#Start cleaning the environment
rm(list = ls())
#Set the working directory
setwd("C:/Users/manuu/OneDrive/Escritorio/UC3M/Cuatri 1 Bimestre 2/Predictive Models/Trabajo 1")Communities and Crime Analysis
Predictive Modeling | Master in Big Data Analytics
1 Introduction
Crime affects communities in diverse ways, influenced by socio-economic factors, demographics, and law enforcement practices. Understanding these relationships is crucial for effective policy-making and resource allocation.
In this study, we analyze the Communities and Crime Unnormalized Data Set from the UCI Machine Learning Repository. This dataset combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 LEMAS survey, and crime data from the 1995 FBI UCR.
1.1 The Dataset
We will work with a rich dataset containing information on 2,215 communities in the United States. The data includes variables related to:
Identification variables: community identifiers (name, state and geographic codes) without predictive content.
Socio‑economic variables: income, poverty, welfare and retirement income, and employment rates
Demographic variables: age distribution, racial and ethnic composition, migration and foreign‑born shares.
Housing variables: crowding, dwelling size, vacancies, housing quality, rents and home values.
Family structure variables: household size, single‑parent and two‑parent families, divorce and marital status.
Police variables: sworn officers, field operations, specialised units and drug enforcement resources.
Crime variables: offence counts, per‑capita crime rates and aggregated violent and non‑violent crime rates.
We start by importing the dataset to R:
#Read the dataset
# na.strings=" ?" is done to get all the interrogations as NA's
datos <- read.table("CommViolPredUnnormalizedData.txt",
sep = ",",
na.strings = c("?", " ", ""),
header = FALSE,
stringsAsFactors = FALSE)Different libraries will be used throught the project, so we will import all of them right now:
library(tidyverse)
library(tidymodels)
library(janitor)
library(knitr)
library(corrplot)
library(naniar)
library(dplyr)
library(ggplot2)
library(moments)
library(maps)
library(viridis)
library(patchwork)
library(usmap)
library(stringr)
library(ggcorrplot)
library(tidyr)
library(MASS)
library(broom)
library(caret)
library(car)
library(lmtest)
library(rsample)
library(yardstick)
library(purrr)
library(car)
library(FactoMineR)
library(factoextra)
library(pROC)
library(ggnewscale)The dataset has 147 variables. Dealing with V1, V2… is not feasible for interpretation, so we name each of them in terms of the original dataset’s names (obtained directly from the same page as the dataset):
# Assignment of column names based on UCI documentation
column_names <- c(
"communityname", "state", "countyCode", "communityCode", "fold",
"population", "householdsize", "racepctblack", "racePctWhite", "racePctAsian",
"racePctHisp", "agePct12t21", "agePct12t29", "agePct16t24", "agePct65up",
"numbUrban", "pctUrban", "medIncome", "pctWWage", "pctWFarmSelf", "pctWInvInc",
"pctWSocSec", "pctWPubAsst", "pctWRetire", "medFamInc", "perCapInc",
"whitePerCap", "blackPerCap", "indianPerCap", "AsianPerCap", "OtherPerCap",
"HispPerCap", "NumUnderPov", "PctPopUnderPov", "PctLess9thGrade", "PctNotHSGrad",
"PctBSorMore", "PctUnemployed", "PctEmploy", "PctEmplManu", "PctEmplProfServ",
"PctOccupManu", "PctOccupMgmtProf", "MalePctDivorce", "MalePctNevMarr",
"FemalePctDiv", "TotalPctDiv", "PersPerFam", "PctFam2Par", "PctKids2Par",
"PctYoungKids2Par", "PctTeen2Par", "PctWorkMomYoungKids", "PctWorkMom",
"NumIlleg", "PctIlleg", "NumImmig", "PctImmigRecent", "PctImmigRec5",
"PctImmigRec8", "PctImmigRec10", "PctRecentImmig", "PctRecImmig5",
"PctRecImmig8", "PctRecImmig10", "PctSpeakEnglOnly", "PctNotSpeakEnglWell",
"PctLargHouseFam", "PctLargHouseOccup", "PersPerOccupHous", "PersPerOwnOccHous",
"PersPerRentOccHous", "PctPersOwnOccup", "PctPersDenseHous", "PctHousLess3BR",
"MedNumBR", "HousVacant", "PctHousOccup", "PctHousOwnOcc", "PctVacantBoarded",
"PctVacMore6Mos", "MedYrHousBuilt", "PctHousNoPhone", "PctWOFullPlumb",
"OwnOccLowQuart", "OwnOccMedVal", "OwnOccHiQuart", "OwnOccQrange",
"RentLowQ", "RentMedian", "RentHighQ", "RentQrange", "MedRent",
"MedRentPctHousInc", "MedOwnCostPctInc", "MedOwnCostPctIncNoMtg",
"NumInShelters", "NumStreet", "PctForeignBorn", "PctBornSameState",
"PctSameHouse85", "PctSameCity85", "PctSameState85", "LemasSwornFT",
"LemasSwFTPerPop", "LemasSwFTFieldOps", "LemasSwFTFieldPerPop",
"LemasTotalReq", "LemasTotReqPerPop", "PolicReqPerOffic", "PolicPerPop",
"RacialMatchCommPol", "PctPolicWhite", "PctPolicBlack", "PctPolicHisp",
"PctPolicAsian", "PctPolicMinor", "OfficAssgnDrugUnits", "NumKindsDrugsSeiz",
"PolicAveOTWorked", "LandArea", "PopDens", "PctUsePubTrans", "PolicCars",
"PolicOperBudg", "LemasPctPolicOnPatr", "LemasGangUnitDeploy",
"LemasPctOfficDrugUn", "PolicBudgPerPop",
"murders", "murdPerPop", "rapes", "rapesPerPop", "robberies", "robbbPerPop",
"assaults", "assaultPerPop", "burglaries", "burglPerPop", "larcenies",
"larcPerPop", "autoTheft", "autoTheftPerPop", "arsons", "arsonsPerPop",
"ViolentCrimesPerPop", "nonViolPerPop"
)
# Check dimensions
cat("Columns in file: ", ncol(datos), "\n")Columns in file: 147
cat("Names in vector: ", length(column_names), "\n")Names in vector: 147
Some names might result confusing, and given R’s Case Sensitivity, we can have some difficulties in this project. This problem caused the innitiative of standarising the variable’s names:
# Assign names and standardize to snake_case
colnames(datos) = column_names
datos = datos %>% clean_names()1.2 The goal
The goal of this project is to build interpretable predictive models of community crime levels using the Communities and Crime dataset. Specifically, we aim to predict violent and non‑violent crime rates per 100,000 inhabitants from socio‑economic, demographic, housing, family structure and policing characteristics, after careful data cleaning and removal of potential sources of data leakage.
So, the target variables for this study are:
# Target description table
target_table <- tibble(
Target_variable = c("violent_crimes_per_pop", "nonviol_per_pop"),
Description = c(
"Violent crime rate per 100K inhabitants.",
"Non-violent crime rate per 100K inhabitants."
)
)
target_table |>
kable(caption = "Target variables")| Target_variable | Description |
|---|---|
| violent_crimes_per_pop | Violent crime rate per 100K inhabitants. |
| nonviol_per_pop | Non-violent crime rate per 100K inhabitants. |
In both cases, the variables are aggregated: each represents the total sum of multiple crime types rather than a single offense. We use these aggregated measures because our goal is to predict the overall crime level in each community, not to distinguish between specific crime types. This approach simplifies the modeling problem and yields more stable results.
- violent_crimes_per_pop = (murders + rapes + robberies + assaults) per 100,000 inhabitants.
- nonviol_per_pop = (burglaries + larcenies + auto thefts + arsons) per 100,000 inhabitants.
1.3 Limitations of the study
Before starting with the study, let’s have a reality check and see what we can expect from this project:
Temporal Gap: The 5-year gap between census data (1990) and crime statistics (1995) assumes community characteristics remained stable during that period.
Critical unmeasured factors: The following factors might be key for explaining crime, but are not included in the dataset. Thus, the quality of our predictions might be worse without these crime indicators
Drug Markets: Illegal drug activity is a major driver of violent crime (turf wars, enforcement conflicts) but is unobserved.
Social Capital: Trust, faith, and informal networks are powerful crime deterrents but cannot be measured in Census data.
Psicology: Broken Windows policing, community policing, and hot-spot policing create vastly different crime outcomes, but policing philosophy is unobserved.
Historical Context: Communities with histories of industrial decline, racial violence, or natural disasters carry trauma that affects crime rates beyond current socioeconomic indicators.
Neighborhood Culture: Norms around violence, “code of the street” dynamics, and tolerance for disorder vary dramatically but are unquantifiable.
1.4 Complexity of the study
Crime and socioeconomic conditions are locked in reciprocal causation:
High Crime -> Businesses Leave -> Unemployment Rises -> Poverty Increases -> More Crime -> Further Business Exit -> …
This creates simultaneity bias: our predictors (poverty, unemployment) are not exogenous; they are themselves influenced by crime. Standard OLS regression assumes predictors are independent of the error term, but this assumption is violated when feedback loops exist.
Our coefficient estimates are then biased. The “effect” of poverty on crime is confounded by the “effect” of crime on poverty. Separating these requires instrumental variables, natural experiments, or longitudinal panel data—none of which are available in this cross-sectional dataset.
We observe correlations, but cannot establish causal direction.
Example 1. Poverty and Crime: - Does poverty cause crime? (Economic strain -> criminal motivation) - Does crime cause poverty? (High crime -> businesses leave -> unemployment rises -> poverty increases)
Example 2. Police Presence: - Do more police reduce crime? (Deterrence effect) - Does more crime require more police? (Reactive deployment)
Our models identify predictors, not interventions. A policymaker cannot simply “reduce family instability” by 10 units and expect crime to drop proportionally, family structure is itself the outcome of deeper economic and cultural forces.
2 Data Preprocessing
Before fitting any predictive model, we need to carefully clean and structure the data. In this stage, we remove non-informative identifiers, exclude variables that would cause data leakage (crime counts and crime rates that are derived from the same outcomes we want to predict), and keep only predictors that could be realistically observed before the crime happens. This ensures that our models are interpretable and that their performance is not artificially inflated by using future information.
2.1 Identification of variables to remove
2.1.1 Identifier Variables
These variables encode administrative identifiers that do not contain predictive behavioral, social, or demographic information. They cannot help forecast crime levels and instead introduce unnecessary noise.
id_vars <- c(
"communityname",
# Name of the community. Text identifier with no predictive value, pure label.
#"state",
# US state code. Geographic identifier, too coarse and creates data leakage (state-level effects).
"county_code",
# County numeric code. Administrative identifier with no interpretable crime relationship.
"community_code",
# Community numeric code. Administrative identifier, arbitrary numbering system.
"fold"
# Cross-validation fold assignment. Technical variable for model evaluation, not a predictor.
)Even though the variable state will not give any information related to prediction, we will incorporate it in the dataset in order to make some geographic plots (in the section Geographic Distribution)
2.1.2 Crime Variables
These variables directly measure crime counts or rates that belong to the same family of outcomes we want to predict. Keeping them would introduce severe data leakage because they contain future information about the target.
crime_vars <- c(
"murders",
# Total number of murders. Raw count used to calculate violent_crimes_per_pop
"rapes",
# Total number of rapes. Raw count used to calculate violent_crimes_per_pop
"robberies",
# Total number of robberies. Raw count used to calculate violent_crimes_per_pop
"assaults",
# Total number of assaults. Raw count used to calculate violent_crimes_per_pop
"burglaries",
# Total number of burglaries. Raw count used to calculate non_viol_per_pop
"larcenies",
# Total number of larcenies. Raw count used to calculate non_viol_per_pop
"auto_theft",
# Total number of auto thefts. Raw count used to calculate non_viol_per_pop
"arsons",
# Total number of arsons. Raw count used to calculate non_viol_per_pop
"murd_per_pop",
# Murder rate per 100K. Component of violent_crimes_per_pop target
"rapes_per_pop",
# Rape rate per 100K. Component of violent_crimes_per_pop target
"robbb_per_pop",
# Robbery rate per 100K. Component of violent_crimes_per_pop target
"assault_per_pop",
# Assault rate per 100K. Component of violent_crimes_per_pop target
"burgl_per_pop",
# Burglary rate per 100K. Component of non_viol_per_pop target
"larc_per_pop",
# Larceny rate per 100K. Component of non_viol_per_pop target
"auto_theft_per_pop",
# Auto theft rate per 100K. Component of non_viol_per_pop target
"arsons_per_pop"
# Arson rate per 100K. Component of non_viol_per_pop target
)2.1.3 Police / LEMAS Variables
This variables come from LEMAS surveys. As these surveys change across states and years, the results obtained are not even comparable. Furthermore, these surveys do not usually take place in smaller populations, so these have extremely high missingness. Including themo would only add noise, reduce model stability, and provide weak predictive signal.
police_lemas_vars <- c(
"lemas_sworn_ft",
# Number of sworn full-time officers. Raw count, highly missing, selection bias.
"lemas_sw_ft_per_pop",
# Sworn officers per population. Highly missing, unreliable despite normalization.
"lemas_sw_ft_field_ops",
# Field operations officers. Subset of total, no added value, highly missing.
"lemas_sw_ft_field_per_pop",
# Field officers per population. Redundant with total officers, highly missing.
"lemas_total_req",
# Total police service requests. Raw count, highly missing, introduces circularity.
"lemas_tot_req_per_pop",
# Requests per population. Highly missing, may reflect crime (endogeneity).
"polic_req_per_offic",
# Requests per officer (workload measure). Highly missing, too sparse to be reliable.
"polic_per_pop",
# Officers per population. Similar to lemas_sw_ft_per_pop, highly missing.
"polic_ave_ot_worked",
# Average overtime worked. Highly missing, endogenous to crime levels.
"racial_match_comm_pol",
# Racial match between community and police. Highly missing, mixed research support.
"pct_polic_white",
# Police force racial composition (white). Highly missing, redundant with community demographics.
"pct_polic_black",
# Police force racial composition (black). Highly missing, redundant with community demographics.
"pct_polic_hisp",
# Police force racial composition (Hispanic). Highly missing, redundant with community demographics.
"pct_polic_asian",
# Police force racial composition (Asian). Highly missing, redundant with community demographics.
"pct_polic_minor",
# Police force racial composition (minority). Highly missing, composite of individual measures.
"offic_assgn_drug_units",
# Officers assigned to drug units. Highly missing, potentially endogenous (assigned due to crime).
"num_kinds_drugs_seiz",
# Number of drug types seized. Highly missing, result of enforcement not predictor.
"lemas_gang_unit_deploy",
# Gang unit deployment indicator. Highly missing, likely endogenous (deployed in response to crime).
"lemas_pct_offic_drug_un",
# Proportion of officers in drug units. Highly missing, may respond to crime rather than predict it.
"polic_cars",
# Number of police cars. Raw count, highly missing, unclear relationship to crime rates.
"polic_oper_budg",
# Police operating budget. Raw dollars not adjusted, highly missing, larger cities have bigger budgets.
"lemas_pct_polic_on_patr",
# Proportion of officers on patrol. Highly missing, operational allocation with unclear predictive value.
"polic_budg_per_pop"
# Police budget per capita. Highly missing, could reflect crime response or just wealth.
)2.1.4 Geographic / Administrative Variables
These variables describe coarse administrative or geographic characteristics of the community. They typically exhibit very low variance across the sample, or provide redundant information that is already captured by more granular socioeconomic variables (population density, housing structure, mobility). Their predictive value would be minimal.
geo_admin_vars <- c(
"numb_urban",
# Number of urban areas. Administrative count with no clear crime relationship, redundant with pct_urban.
"land_area",
# Total land area in square miles. Size without population context, replaced by pop_dens.
"pct_urban"
# Proportion living in urban areas. Highly correlated with pop_dens, which is more direct.
)In our dataset, we already have a variable called pop_dens(population density = people per square mile), which is a direct measure of urbanization capturing concentration, crime opportunity, and anonymity.
2.1.5 Immigration Variables
Multiple overlapping windows describing the same phenomenon (recent immigration). They introduce multicollinearity and add redundant structure with little unique information, so only the most informative ones are kept.
immig_vars <- c(
"pct_immig_rec5",
# Immigrants arriving in last 5 years. Too short window, replaced by pct_immig_rec10.
"pct_rec_immig5",
# Alternative 5-year immigration coding. Duplicate information, replaced by pct_rec_immig10.
"pct_rec_immig8",
# 8-year immigration window. Middle ground with no advantage, replaced by pct_rec_immig10.
"num_immig",
# Absolute number of immigrants. Raw count not comparable, replaced by pct_foreign_born.
"num_illeg",
# Absolute number of children born to unmarried parents. Raw count, replaced by pct_illeg.
"pct_rec_immig10"
# Percentage of inmigrants that came in the last 10 years. Replaced by pct_inmig_rec10
)Variables kept in the dataset:
pct_foreign_born: Overall foreign-born population, stable long-term immigrant composition measure.
pct_immig_recent: Recent immigration activity, captures migration flows and community disruption.
pct_immig_rec10: 10-year immigration window, more stable than 5-year.
pct_illeg: Percentage of children born to unmarried parents (family structure measure)
2.1.6 Age Variables
The dataset includes 4 overlapping age range variables. We keep only the most informative non-overlapping ones.
age_vars <- c(
"age_pct12t21",
# Population aged 12-21. Completely nested within age_pct12t29, creates multicollinearity.
"age_pct16t24"
# Population aged 16-24. Heavily overlaps with age_pct12t29, adds redundant information.
)Age variables kept:
age_pct12t29: Proportion of population aged 12-29.
age_pct65up: Proportion of population aged 65 and older.
2.1.7 Housing & Household Structure Variables
Many are percentiles, medians, or counts that duplicate information from cleaner socioeconomic variables. Frequently skewed or low-variance, so they are removed during preprocessing.
housing_vars <- c(
"rent_low_q",
# Lower quartile of rent. Redundant with rent_median and rent_qrange.
"rent_high_q",
# Upper quartile of rent. Redundant with rent_median and rent_qrange.
"med_rent",
# Median gross rent. Exact duplicate of rent_median variable, keeping only rent_median.
"pers_per_own_occ_hous",
# Persons per owner-occupied household. Owners rarely crowded, little predictive value.
"pers_per_rent_occ_hous"
# Persons per renter-occupied household. Redundant with general crowding measure.
)Variables kept:
own_occ_low_quart: Lower quartile of home values, captures poverty concentration.
own_occ_med_val: Median home value, stable measure of community wealth.
own_occ_hi_quart: Upper quartile of home values, captures wealth concentration.
own_occ_qrange: Interquartile range of home values, measures economic inequality. Might result redundant with the Q1, Q3 variables, but it gives key information about the economy of a community
rent_median: Median gross rent, typical housing costs for renters.
rent_qrange: Interquartile range of rent, measures rent inequality.
med_rent_pct_hous_inc: Rent burden, proportion of income spent on rent.
med_own_cost_pct_inc: Ownership cost burden, proportion of income on mortgage and taxes.
pers_per_occup_hous: Persons per household, overall crowding measure.
2.1.8 Homeless Population Variables
These variables often contain zeros or missing values, and may not be reliably collected across all communities.
homeless_vars <- c(
"num_in_shelters",
# People in homeless shelters. Absolute count not comparable, often unreported in small communities.
"num_street"
# Homeless people on streets. Even less reliable than shelters, point-in-time counts are poor.
)2.1.9 Advanced Dimensionality Reduction
At this point, strict data cleaning has been performed, removing identifiers, leakage variables, and sparse data. However, the dataset remains high-dimensional (>80 variables) with significant multicollinearity (e.g., multiple variables measuring similar aspects of income, family structure, or housing). So in this section, the focus will be reducing drastically the number of variables.
Feature Engineering: Creating a composite family_risk_index to capture family instability in a single metric.
During our initial analysis, we identified that the dataset contains multiple highly correlated variables related to family structure (e.g., divorce rates, percentage of children born to unmarried parents, and percentage of single-parent households). Including all of them would introduce severe multicollinearity without adding distinct information. We use these three variables:
pct_illeg: Percentage of kids born to unmarried parents.
total_pct_div: Divorce rate.
Inverted pct_kids2par: Since pct_kids2par (percentage of kids in two-parent households) represents stability, we calculate 100 - pct_kids2par to align its direction with the other two variables (representing instability).
This engineered feature allows us to capture the predictive signal of family structure in a single, robust metric.
datos <- datos %>%
mutate(
# Average of Illegitimacy, Divorce, and Single-Parenthood (inverted)
family_risk_index = (pct_illeg + total_pct_div + (100 - pct_kids2par)) / 3
)Redundancy Removal: To reduce multicollinearity and overfitting, we define a list of variables for a secondary, aggressive reduction phase based on the following criteria:
A. Redundant Index Components: Since we constructed the family_risk_index, the individual components (such as pct_illeg and total_pct_div) and their proxies are now mathematically redundant. We remove them to avoid collinearity with our new engineered feature.
B. Race-Specific Income: The dataset includes per capita income for every racial group (e.g., white_per_cap, black_per_cap). These are unnecessary because the model can essentially infer these socioeconomic dynamics using the general med_income variable combined with the community’s racial demographics.
C. Granular Employment: Variables detailing specific employment sectors (e.g., manufacturing vs. professional services) introduce noise without adding clear predictive signal. We prioritize overall economic health indicators like unemployment rates and median income over specific industry types.
D. Housing Redundancies: Quartiles and ranges for rent and housing values (e.g., rent_low_q) are highly correlated with the median. We retain only the median measures as they efficiently capture the central tendency of housing costs without duplication.
E. Low Variance / Obsolete: Variables such as lack of phone service (pct_hous_no_phone) or lack of plumbing exhibit near-zero variance in modern data. These are obsolete indicators that provide no discriminative power between communities.
advanced_reduction_vars <- c(
# A. Components of the new Index (now redundant)
"pct_illeg", "total_pct_div", "pct_kids2par",
"pct_fam2par", "pct_young_kids2par", "pct_teen2par",
"male_pct_divorce", "female_pct_div", "male_pct_nev_marr",
"pct_work_mom", "pct_work_mom_young_kids", "num_illeg",
# B. Race-Specific Income (Redundant with med_income + race percentages)
"white_per_cap", "black_per_cap", "indian_per_cap",
"asian_per_cap", "other_per_cap", "hisp_per_cap", "med_fam_inc",
# C. Granular Employment/Financial Details (Noise)
"pct_employ", "pct_empl_manu", "pct_empl_prof_serv",
"pct_occup_manu", "pct_occup_mgmt_prof",
"pct_w_wage", "pct_w_farm_self", "pct_w_inv_inc",
"pct_w_soc_sec", "pct_w_retire", "pct_w_pub_asst",
# D. Housing Redundancies (Quartiles & Ranges)
"own_occ_low_quart", "own_occ_hi_quart", "own_occ_qrange",
"rent_low_q", "rent_high_q", "rent_qrange",
"med_rent_pct_hous_inc", "med_own_cost_pct_inc", "med_own_cost_pct_inc_no_mtg",
# E. Low Variance / Obsolete
"pct_hous_no_phone", "pct_wo_full_plumb", "med_yr_hous_built",
"pct_born_same_state", "pct_same_house85", "pct_same_city85", "pct_same_state85"
)2.2 Missing Values Analysis
Before applying the final cleaning, we identify the percentage of missing values that each variable has:
# Combine preliminary variables to remove
preliminary_vars_to_remove <- c(
id_vars,
crime_vars,
police_lemas_vars,
geo_admin_vars,
immig_vars,
age_vars,
housing_vars,
homeless_vars,
advanced_reduction_vars
)
preliminary_vars_to_remove <- unique(preliminary_vars_to_remove)
# Create preliminary cleaned dataset
datos_clean <- datos %>%
dplyr::select(-all_of(preliminary_vars_to_remove))Does any of the final variables contain a big percentage of NA values?
# Calculate percentage of NAs for remaining variables
na_percentage <- datos_clean %>%
summarise(across(everything(), ~ mean(is.na(.)) * 100))
na_percentagehigh_na_threshold <- 20
# Filter and show only variables with >20% NAs
high_na_results <- na_percentage %>%
dplyr::select(where(~ . > high_na_threshold))
cat("\nNumber of variables exceeding threshold:", ncol(high_na_results), "\n")
Number of variables exceeding threshold: 0
So no variables are eliminated using this criteria.
2.2.1 Missing values in Target Variables
We can also see that there are some missing values in the target variables, so we do not have other choice but to eliminate those instances which have NA in the target variables:
# Check missing values in targets
cat(" violent_crimes_per_pop:", sum(is.na(datos_clean$violent_crimes_per_pop)),
sprintf("(%.2f%%)\n", mean(is.na(datos_clean$violent_crimes_per_pop)) * 100)) violent_crimes_per_pop: 221 (9.98%)
cat(" non_viol_per_pop:", sum(is.na(datos_clean$non_viol_per_pop)),
sprintf("(%.2f%%)\n", mean(is.na(datos_clean$non_viol_per_pop)) * 100)) non_viol_per_pop: 97 (4.38%)
# Remove observations with missing target values
datos_clean <- datos_clean %>%
filter(!is.na(violent_crimes_per_pop) & !is.na(non_viol_per_pop))
cat("\nObservations after removing missing targets:", nrow(datos_clean), "\n")
Observations after removing missing targets: 1902
cat("Observations removed:", nrow(datos) - nrow(datos_clean), "\n")Observations removed: 313
2.2.2 Missing values in predictive variables
After removing observations with missing targets, we check if any predictor variables still contain missing values that need to be addressed.
# Remove target variables to focus only on predictors
predictors_only <- datos_clean %>%
dplyr::select(-violent_crimes_per_pop, -non_viol_per_pop)
# Calculate missing values in predictors
predictor_na_summary <- predictors_only %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
mutate(pct_missing = n_missing / nrow(datos_clean) * 100) %>%
filter(n_missing > 0) %>%
arrange(desc(n_missing))
if(nrow(predictor_na_summary) > 0) {
print(predictor_na_summary)
} else {
cat("No missing values in any predictor variable.\n")
}No missing values in any predictor variable.
2.3 Data Cleaning
After handling all missing values,we have successfully cleaned the dataset by removing all previously identified problematic and unnecessary variables.
cat("Original dataset:", ncol(datos), "variables and", nrow(datos), "observations \n")Original dataset: 148 variables and 2215 observations
cat("Cleaned dataset:", ncol(datos_clean), "variables and", nrow(datos_clean), "observations \n")Cleaned dataset: 44 variables and 1902 observations
We eliminated more than 100 variables at this point, but a model with 40+ variables is still too complex. So we will do a final cleaning block.
2.3.1 Final Feature Selection
Three different techniques will be used for this final dimensionality reduction.
2.3.1.1 Multicollinearity Analysis
numeric_predictors <- datos_clean %>%
dplyr::select(-violent_crimes_per_pop, -non_viol_per_pop) %>%
dplyr::select(where(is.numeric))
# Calculate Correlation Matrix
cor_matrix <- cor(numeric_predictors, use = "pairwise.complete.obs")
# Find high correlations
f <- function(mat, threshold = 0.9) {
mat[upper.tri(mat, diag = TRUE)] <- NA
mat_df <- as.data.frame(as.table(mat))
mat_df <- mat_df %>%
filter(!is.na(Freq), abs(Freq) > threshold) %>%
arrange(desc(abs(Freq)))
return(mat_df)
}
high_corr_pairs <- f(cor_matrix, threshold = 0.80)
kable(high_corr_pairs, caption = "Variable pairs with correlation > 0.80")| Var1 | Var2 | Freq |
|---|---|---|
| num_under_pov | population | 0.9891864 |
| pct_larg_house_occup | pct_larg_house_fam | 0.9864039 |
| pct_hous_own_occ | pct_pers_own_occup | 0.9822539 |
| pct_immig_rec10 | pct_immig_rec8 | 0.9479702 |
| pers_per_occup_hous | pers_per_fam | 0.9383811 |
| pct_not_hs_grad | pct_less9th_grade | 0.9315668 |
| pct_not_speak_engl_well | pct_speak_engl_only | -0.9283411 |
| hous_vacant | population | 0.9278613 |
| hous_vacant | num_under_pov | 0.9188135 |
| pct_foreign_born | pct_recent_immig | 0.9152083 |
| pct_speak_engl_only | race_pct_hisp | -0.9150018 |
| pct_larg_house_occup | pers_per_fam | 0.9031598 |
| pct_pers_dense_hous | pct_not_speak_engl_well | 0.8954340 |
| pct_not_speak_engl_well | race_pct_hisp | 0.8930820 |
| per_cap_inc | med_income | 0.8877202 |
| rent_median | own_occ_med_val | 0.8844716 |
| pct_larg_house_fam | pers_per_fam | 0.8812535 |
| pers_per_occup_hous | householdsize | 0.8789665 |
| pct_pers_dense_hous | pct_larg_house_fam | 0.8753514 |
| pct_pers_dense_hous | race_pct_hisp | 0.8658722 |
| pct_foreign_born | pct_speak_engl_only | -0.8645720 |
| pct_foreign_born | pct_not_speak_engl_well | 0.8596209 |
| rent_median | med_income | 0.8546328 |
| pct_pers_dense_hous | pct_larg_house_occup | 0.8481043 |
| pers_per_fam | householdsize | 0.8354050 |
| pct_immig_rec8 | pct_immig_recent | 0.8265228 |
| pers_per_occup_hous | pct_larg_house_occup | 0.8195686 |
| pct_pers_dense_hous | pct_speak_engl_only | -0.8178767 |
| own_occ_med_val | per_cap_inc | 0.8104073 |
| pct_not_speak_engl_well | pct_recent_immig | 0.8021484 |
Let’s eliminate part of the variables now, as there is very big correlation:
vars_high_corr_remove <- c(
### Absolute Counts (Redundant with population or rates)
"num_under_pov",
# Correlation 0.99 with population (we keep the rate: pct_pop_under_pov)
"hous_vacant",
# Correlation 0.93 with population
### Housing Redundancies (Measuring the same concept via different metrics)
"pct_larg_house_occup",
# Corr 0.99 with pct_larg_house_fam (Family size is more relevant for crime)
"pct_pers_own_occup",
# Corr 0.98 with pct_hous_own_occ (Persons vs Houses: duplicate info)
"pct_pers_dense_hous",
# Corr 0.89 with pct_not_speak_engl_well (Acts as a proxy for poverty/immigration)
"pct_hous_less3br",
# High negative corr with home ownership (Small houses = rentals)
"med_num_br",
# Redundant with house size
### Immigration & Language Redundancies
"pct_immig_rec8",
# Corr 0.95 with pct_immig_rec10 (Overlapping time windows)
"pct_immig_rec10",
# Corr 0.75 with pct_immig_recent (We keep 'recent' as the standard flow measure)
"pct_speak_engl_only",
# Corr -0.93 with pct_not_speak_engl_well (Perfect inverse redundancy)
"pct_recent_immig",
# Corr 0.91 with pct_foreign_born (High overlap, keeping the more stable 'foreign_born')
### Socioeconomic Redundancies
"per_cap_inc",
# Corr 0.89 with med_income (Median is more robust to outliers than per capita)
"own_occ_med_val",
# Corr 0.88 with rent_median (Home value and Rent are proxies for the same area wealth)
"pct_less9th_grade",
# Corr 0.93 with pct_not_hs_grad (Subset of 'Not High School Grad', which is the standard metric)
"pers_per_fam"
# Corr 0.94 with pers_per_occup_hous (Measures crowdiness; householdsize is preferred)
)
# Apply the removal to the dataset
datos_clean2 <- datos_clean %>%
dplyr::select(-any_of(vars_high_corr_remove))
cat("Variables removed due to multicollinearity:", length(vars_high_corr_remove), "\n")Variables removed due to multicollinearity: 15
cat("Final dataset dimensions:", dim(datos_clean2)[1], "observations and", dim(datos_clean2)[2], "variables.\n")Final dataset dimensions: 1902 observations and 29 variables.
2.3.2 Target Correlation
Now, the criteria for eliminating variables will be the following: we calculate the correlation of each remaining variable with the targets and retain only the strongest predictors. To decide which variables to discard, we evaluate the Mean Absolute Correlation of each variable across both target types. Variables with the lowest average correlation are considered the “weakest link”; as they provide the least explanatory power for the general crime phenomenon.
We proceed to identify and remove the bottom-performing variables:
# Calculate correlations with both targets
cor_viol <- datos_clean2 %>%
dplyr::select(-violent_crimes_per_pop, -non_viol_per_pop) %>%
dplyr::select(where(is.numeric)) %>%
map_dbl(~ cor(.x, datos_clean2$violent_crimes_per_pop, use = "complete.obs"))
cor_nonviol <- datos_clean2 %>%
dplyr::select(-violent_crimes_per_pop, -non_viol_per_pop) %>%
dplyr::select(where(is.numeric)) %>%
map_dbl(~ cor(.x, datos_clean2$non_viol_per_pop, use = "complete.obs"))
# Combine and calculate the mean impact
variable_ranking <- tibble(
variable = names(cor_viol),
corr_violent = cor_viol,
corr_nonviol = cor_nonviol
) %>%
mutate(
mean_impact = (abs(corr_violent) + abs(corr_nonviol)) / 2
) %>%
arrange(mean_impact) # Sort Ascending
kable(variable_ranking, digits = 3, caption = "Predictor Ranking. Weakest to Strongest Mean Impact")| variable | corr_violent | corr_nonviol | mean_impact |
|---|---|---|---|
| pct_vac_more6mos | 0.018 | -0.043 | 0.030 |
| race_pct_asian | 0.036 | -0.035 | 0.035 |
| age_pct65up | 0.054 | 0.127 | 0.090 |
| householdsize | -0.020 | -0.193 | 0.106 |
| age_pct12t29 | 0.110 | 0.111 | 0.111 |
| pers_per_occup_hous | -0.020 | -0.202 | 0.111 |
| pct_use_pub_trans | 0.190 | 0.032 | 0.111 |
| pct_foreign_born | 0.202 | 0.059 | 0.130 |
| pct_immig_recent | 0.145 | 0.175 | 0.160 |
| population | 0.212 | 0.119 | 0.166 |
| pop_dens | 0.262 | 0.088 | 0.175 |
| pct_not_speak_engl_well | 0.280 | 0.148 | 0.214 |
| race_pct_hisp | 0.265 | 0.175 | 0.220 |
| pct_larg_house_fam | 0.341 | 0.163 | 0.252 |
| pct_hous_occup | -0.256 | -0.304 | 0.280 |
| pct_b_sor_more | -0.299 | -0.271 | 0.285 |
| rent_median | -0.230 | -0.341 | 0.285 |
| pct_vacant_boarded | 0.475 | 0.324 | 0.399 |
| pct_not_hs_grad | 0.467 | 0.367 | 0.417 |
| med_income | -0.396 | -0.466 | 0.431 |
| pct_unemployed | 0.475 | 0.392 | 0.434 |
| pct_hous_own_occ | -0.461 | -0.462 | 0.461 |
| pct_pop_under_pov | 0.499 | 0.510 | 0.505 |
| racepctblack | 0.624 | 0.474 | 0.549 |
| race_pct_white | -0.676 | -0.477 | 0.576 |
| family_risk_index | 0.743 | 0.678 | 0.710 |
threshold <- 0.15
vars_below_threshold <- variable_ranking %>%
filter(mean_impact < threshold)
drop_names <- vars_below_threshold %>% pull(variable)
datos_model <- datos_clean2 %>%
dplyr::select(-all_of(drop_names))
# Results
cat("Variables removed in this step:", length(drop_names), "\n")Variables removed in this step: 8
cat("Final Predictors:", ncol(datos_model) - 2, "\n") # Excluding the 2 targetsFinal Predictors: 19
# Display the final list of predictors
names(datos_model)[!names(datos_model) %in% c("violent_crimes_per_pop", "non_viol_per_pop")] [1] "state" "population"
[3] "racepctblack" "race_pct_white"
[5] "race_pct_hisp" "med_income"
[7] "pct_pop_under_pov" "pct_not_hs_grad"
[9] "pct_b_sor_more" "pct_unemployed"
[11] "pct_immig_recent" "pct_not_speak_engl_well"
[13] "pct_larg_house_fam" "pct_hous_occup"
[15] "pct_hous_own_occ" "pct_vacant_boarded"
[17] "rent_median" "pop_dens"
[19] "family_risk_index"
2.4 Data Scaling
For this project, the decision not to standardize the predictors (not to transform them to mean 0 and standard deviation 1). This choice is based on the following technical and analytical criteria:
Interpretability of Coefficients: In a criminological context, it is highly valuable to interpret the \(\beta\) coefficients in their original units. For example, a coefficient for
pct_unemployedallows us to state the expected change in crime rate for every 1 percentage point increase in unemployment. Standardizing would transform these into standard deviation units, which are less intuitive for policy-making.Comparable Ranges: The majority of variables do not exhibit massive magnitude differences, as the socio-economic indicators are mostly bounded between 0 and 100 or 0 and 1.
Numerical Stability: Ordinary Least Squares is scale-invariant regarding its predictive power and \(R^2\). Given that the condition number of our predictor matrix is managed (through the previous removal of highly collinear variables with \(|r| > 0.85\)), the lack of scaling will not lead to computational instability. Also, final multicollinearity will be formally assessed using the Variance Inflation Factor (VIF) once the specific models are fitted, ensuring that no redundant information compromises the coefficient estimates.
No Penalization Applied: Since this project focuses on Linear Regression and GLMs without \(L_1\) or \(L_2\) regularization (like LASSO or Ridge), the fairness issue of penalizing variables with different scales is not applicable here.
The dataset is now ready for exploratory data analysis and modeling. All problematic variables have been removed, missing target observations eliminated, and the single predictor missing value imputed with a robust method. It is time to start analyzing the dataset:
3 Explorative Data Analysis
Exploratory Data Analysis (EDA) is the foundational step of any predictive modeling project. It is a stage of methodological rigor aimed at understanding the structure of the data and validating if it meets the statistical assumptions required by the models we intend to use (in this case, Linear Regression).
The EDA is not merely a visualization exercise; it is a critical stage that demonstrates critical thinking by diagnosing problems that, if uncorrected, would invalidate our model’s conclusions and predictions.
3.1 Target Variables Analysis
We begin the EDA by focusing on the most crucial element for the model: the distribution and characteristics of our two target variables, violent crime rate and non-violent crime rate per 100,000 inhabitants.
# Calculate summary statistics for both target variables
target_stats <- datos_model %>%
dplyr::select(violent_crimes_per_pop, non_viol_per_pop) %>%
summarise(
Mean = map_dbl(., mean, na.rm = TRUE),
Median = map_dbl(., median, na.rm = TRUE),
SD = map_dbl(., sd, na.rm = TRUE),
Min = map_dbl(., min, na.rm = TRUE),
Max = map_dbl(., max, na.rm = TRUE),
Q1 = map_dbl(., quantile, 0.25, na.rm = TRUE),
Q3 = map_dbl(., quantile, 0.75, na.rm = TRUE),
Skewness = map_dbl(., moments::skewness, na.rm = TRUE)
) %>%
# Reshape the table for clear comparison
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Target = rep(c("violent_crimes_per_pop", "non_viol_per_pop"), each = 8)) %>%
pivot_wider(names_from = Target, values_from = Value)
target_stats %>%
kable(caption = "Descriptive Statistics of Target Variables", digits = 2)| Statistic | violent_crimes_per_pop | non_viol_per_pop |
|---|---|---|
| Mean | 583.69 | 4942.32 |
| Median | 369.32 | 4479.66 |
| SD | 608.27 | 2786.61 |
| Min | 6.64 | 116.79 |
| Max | 4877.06 | 27119.76 |
| Q1 | 164.24 | 2913.28 |
| Q3 | 792.69 | 6271.72 |
| Skewness | 2.10 | 1.59 |
Both variables are right-shifted, as we can see both in the Skewness coefficient and in the following histograms and box plots:
# Violent Crime Rate
p1_strict <- datos_model %>%
ggplot(aes(x = violent_crimes_per_pop)) +
geom_histogram(bins = 50, fill = "#E74C3C", alpha = 0.8, color = "white", linewidth = 0.2) +
labs(x = "Violent Crimes per Pop", y = "Count") +
theme_classic(base_size = 12)
# Non Violent Crime Rate
p2_strict <- datos_model %>%
ggplot(aes(x = non_viol_per_pop)) +
geom_histogram(bins = 50, fill = "#3498DB", alpha = 0.8, color = "white", linewidth = 0.2) +
labs(x = "Non-Violent Crimes per Pop", y = "Count") +
theme_classic(base_size = 12)
p1_strict + p2_strict +
plot_annotation(title = "Distribution of Crime Rates",
theme = theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 16)))# Boxplots to identify outliers
datos_box <- datos_model %>%
dplyr::select(violent_crimes_per_pop, non_viol_per_pop) %>%
tidyr::pivot_longer(everything(), names_to = "Crime_Type", values_to = "Rate") %>%
mutate(Crime_Type = factor(Crime_Type,
levels = c("violent_crimes_per_pop", "non_viol_per_pop"),
labels = c("Violent Crime Rate", "Non-Violent Crime Rate"))) # Etiquetas legibles
p_combined_box <- datos_box %>%
ggplot(aes(x = Crime_Type, y = Rate, fill = Crime_Type)) +
geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.size = 1.5) +
# Ahora los nombres coinciden exactamente con los labels del factor anterior
scale_fill_manual(values = c("Violent Crime Rate" = "#E74C3C",
"Non-Violent Crime Rate" = "#3498DB")) +
labs(
title = "Distribution and Outliers of Crime Rates",
x = "",
y = "Rate per 100K Population"
) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
legend.position = "none")
p_combined_boxThe plots reveal that both crime rates are strongly right-skewed, with non-violent crimes showing significantly higher magnitude and variance than violent crimes. The boxplots confirm the presence of numerous extreme outliers in both categories. Thus, a transformation may be necessary to satisfy the normality assumption of the linear regression model.
3.1.1 Relationship Between Target Variables
Before selecting a modeling strategy, it is crucial to examine the relationship between the target variables. Understanding whether these variables are independent or highly correlated helps determine if they share common underlying drivers. This scatter plot visually assesses their linearity and distribution.
# Scatter plot: violent vs non-violent crime
datos_model %>%
ggplot(aes(x = violent_crimes_per_pop, y = non_viol_per_pop)) +
geom_point(alpha = 0.5, color = "darkblue") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(title = "Relationship Between Violent and Non-Violent Crime",
x = "violent_crimes_per_pop",
y = "non_viol_per_pop") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))# Correlation between targets
cor_targets <- cor(datos_model$violent_crimes_per_pop,
datos_model$non_viol_per_pop)
cat("Correlation between violent and non-violent crime:",
round(cor_targets, 3), "\n")Correlation between violent and non-violent crime: 0.675
Some insights are extracted from the plot:
Positive Correlation: As expected, there is a strong positive linear relationship. Communities with high rates of violent crime almost invariably experience high rates of non-violent crime, suggesting they are symptoms of similar socioeconomic factors.
Heteroscedasticity: As crime rates increase, the variance in the data grows significantly. This indicates that the model’s error will likely not be constant across all values.
Data Skewness: The density of points in the bottom-left corner confirms that the dataset is right-skewed; most communities have relatively low crime rates, while a few hotspots drive the upper range of the scale.
3.1.2 Target Variables Transformations
As it could be seen in the prior section Target Variables Analysis, none of the dependent variables follow a normal distribution.
Linear Regression relies on the assumption that the residuals are normally distributed. In criminological data, it is common to find highly skewed distributions (many safe neighborhoods, few extremely dangerous ones). We analyze the target variables to determine if a transformation is necessary.
3.1.2.1 Violent Crimes Variable
Let’s start with the Violent Crime Variable Analysis. First, we will do the logarithmic plot and then we will study how accurate this transformation is.
# Variable Distribution
p1 <- datos_model %>%
ggplot(aes(x = violent_crimes_per_pop)) +
geom_histogram(bins = 50, fill = "red", alpha = 0.7, color = "white") +
labs(title = "Target Distribution", x = "Violent Crimes", y = "Count") + theme_classic() + theme(plot.title = element_text(hjust = 0.5))
# QQ Plot
p3 <- ggplot(datos_model, aes(sample = violent_crimes_per_pop)) +
stat_qq() + stat_qq_line() + labs(title = "QQ-Plot") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
(p1 + p3) # Log-Transformed Target Variable Distribution (+1, to avoid taking the logarithm of 0)
p2 <- datos_model %>%
ggplot(aes(x = log(violent_crimes_per_pop +1))) +
geom_histogram(bins = 50, fill = "blue", alpha = 0.7, color = "white") +
labs(title = "Log-Transformed Target Distribution", x = "Log(Violent Crimes)", y = "Count") +
theme_classic()
# QQ Plot
p4 <- ggplot(datos_model, aes(sample = log(violent_crimes_per_pop + 1))) +
stat_qq() + stat_qq_line() + labs(title = "QQ-Plot") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
(p2 + p4)Visually, the log transformation appears effective. To validate this mathematically, we use the Box-Cox transformation method to find the optimal power parameter (\(\lambda\)) that maximizes normality.
First, we visually inspect the distribution of Violent Crime and use the Box-Cox method to find the optimal power transformation (\(\lambda\)). - \(\lambda \approx 1\): No transformation. - \(\lambda \approx 0\): Log transformation. - \(\lambda \approx 0.5\): Sqrt transformation.
Explicit approximations can be made for each value of \(\lambda\), but sometimes it’s more interesting to be a little bit less precise and mantain as much the interpretation of the model.
Remark: Given a value \(\lambda\), the associated Box-Cox transformation is:
\[y^{(\lambda)} = \begin{cases} \dfrac{y^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \ln(y) & \text{if} \lambda = 0 \end{cases}\] But in some cases, the transformation used will be directly \(Y^{\lambda}\); as the substraction and the division are just linear transformations of the “normalized” variable.
# Box-Cox for Violent Crime
bc_viol <- boxcox(lm(violent_crimes_per_pop + 1 ~ 1, data = datos_model),
lambda = seq(-1, 1, 0.001), plotit = TRUE)
title(main = "Violent Crime λ", line = 2)# Find exact Lambda values
lambda_viol <- bc_viol$x[which.max(bc_viol$y)]
cat("Optimal Lambda:", lambda_viol, "\n")Optimal Lambda: 0.115
Since 0.115 is very close to 0, a Log-Transformation could be the most effective method to normalize the data. But let’s try the Box-Cox Transformation with \(\lambda = 0.115\):
# Box-Cox Transformation for Violent Crimes
lambda_viol <- 0.115
p_hist_box_viol <- datos_model %>%
ggplot(aes(x = ((violent_crimes_per_pop)^lambda_viol - 1) / lambda_viol)) +
geom_histogram(bins = 50, fill = "purple", alpha = 0.7, color = "white") +
labs(title = paste("Box-Cox Distribution (λ =", lambda_viol, ")"),
x = "Box-Cox Transformed (Violent)", y = "Count") +
theme_classic() + theme(plot.title = element_text(hjust = 0.5))
p_qq_box_viol <- ggplot(datos_model, aes(sample = ((violent_crimes_per_pop)^lambda_viol - 1) / lambda_viol)) +
stat_qq() + stat_qq_line() +
labs(title = paste("QQ-Plot (λ =", lambda_viol, ")")) +
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
# Plot combined
(p_hist_box_viol + p_qq_box_viol)As observed in the comparison, the difference between the Log-Transformation (\(\lambda=0\)) and the optimal Box-Cox transformation (\(\lambda=0.115\)) is visually subtle, which is expected given the proximity of \(\lambda\) to zero. However, to ensure the highest degree of precision and methodological consistency with the previous variable analysis, we will proceed with the Box-Cox transformed variable (\(\lambda=0.115\)).
While we will continue exploring the relationships between variables using the original scale for interpretability in the rest of the EDA, we will define a new transformed variable viol_trans to be used specifically in the Linear Regression model to satisfy the normality assumption.
3.1.2.2 Non-Violent Crimes Variable
In this case, let’s start by calculating \(\lambda\), and then we will do the appropiate transformation.
# Box-Cox for Non-Violent Crime
bc_nonviol <- boxcox(lm(non_viol_per_pop + 1 ~ 1, data = datos_model),
lambda = seq(-1, 1, 0.001), plotit = TRUE)
title(main = "Non-Violent Crime λ", line = 2)# Find exact Lambda values
lambda_nonviol <- bc_nonviol$x[which.max(bc_nonviol$y)]
cat("Optimal Lambda:", lambda_nonviol, "\n")Optimal Lambda: 0.246
In this case, 0.246 is closer to 0 than to 0.5 still, but the difference is no that small. So we will consider three transformations (\(\lambda = 0\), \(\lambda = 0.5\), \(\lambda = 0.246\)) and then we will choose the most convenient for our model.
Original Data
No transformation: \(Y\)
# Original Data Distribution & QQ Plot
p_hist_orig <- datos_model %>%
ggplot(aes(x = non_viol_per_pop)) +
geom_histogram(bins = 50, fill = "red", alpha = 0.7, color = "white") +
labs(title = "Original Distribution", x = "Non-Violent Crimes", y = "Count") +
theme_classic() + theme(plot.title = element_text(hjust = 0.5))
p_qq_orig <- ggplot(datos_model, aes(sample = non_viol_per_pop)) +
stat_qq() + stat_qq_line() +
labs(title = "QQ-Plot (Original)") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
# Plot combined
(p_hist_orig + p_qq_orig)We can confirm that this target variable is also right-skewed, and deviated from normality. So it needs a transformation.
Logarithmic Transformation We apply the logarithm to compress the right tail. A constant was added (+1) to avoid the logarithm of 0.
Transformation: \(log(Y+1)\)
# Log-Transformed Distribution & QQ Plot
p_hist_log <- datos_model %>%
ggplot(aes(x = log(non_viol_per_pop + 1))) +
geom_histogram(bins = 50, fill = "blue", alpha = 0.7, color = "white") +
labs(title = "Log Distribution (λ = 0)", x = "Log(Non-Violent Crimes)", y = "Count") +
theme_classic() + theme(plot.title = element_text(hjust = 0.5))
p_qq_log <- ggplot(datos_model, aes(sample = log(non_viol_per_pop + 1))) +
stat_qq() + stat_qq_line() +
labs(title = "QQ-Plot (Log)") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
# Plot combined
(p_hist_log + p_qq_log)Square Root Transformation We try a transformation less agressive than the logarithm, useful for data whose variance grows with the mean.
Transformation: \(\sqrt{Y}\)
# Square Root Transformed Distribution & QQ Plot
p_hist_sqrt <- datos_model %>%
ggplot(aes(x = sqrt(non_viol_per_pop))) +
geom_histogram(bins = 50, fill = "green", alpha = 0.7, color = "white") +
labs(title = "Sqrt Distribution (λ = 0.5)", x = "Sqrt(Non-Violent Crimes)", y = "Count") +
theme_classic() + theme(plot.title = element_text(hjust = 0.5))
p_qq_sqrt <- ggplot(datos_model, aes(sample = sqrt(non_viol_per_pop))) +
stat_qq() + stat_qq_line() +
labs(title = "QQ-Plot (Sqrt)") +
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
# Plot combined
(p_hist_sqrt + p_qq_sqrt)Optimal Box-Cox Transformation We apply the logarithm to compress the right tail. A constant was added (+1) to avoid the logarithm of 0.
Transformation: \[\frac{Y^{0.246} - 1}{0.246}\]
# Box-Cox Optimal Distribution & QQ Plot
lambda <- 0.246
p_hist_box <- datos_model %>%
ggplot(aes(x = ((non_viol_per_pop)^lambda - 1) / lambda)) +
geom_histogram(bins = 50, fill = "purple", alpha = 0.7, color = "white") +
labs(title = paste("Box-Cox Distribution (λ =", lambda, ")"),
x = "Box-Cox Transformed", y = "Count") +
theme_classic() + theme(plot.title = element_text(hjust = 0.5))
p_qq_box <- ggplot(datos_model, aes(sample = ((non_viol_per_pop + 1)^lambda - 1) / lambda)) +
stat_qq() + stat_qq_line() +
labs(title = paste("QQ-Plot (λ =", lambda, ")")) +
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
# Plot combined
(p_hist_box + p_qq_box)Clearly, this is the best transformation, so we will apply this transformation to the variable non_viol_per_pop.
While we will continue exploring the relationships between variables using the original scale for interpretability in the rest of the EDA, we will define a new transformed variable non_viol_trans to be used specifically in the Linear Regression model to satisfy the normality assumption.
3.2 Geographic Distribution
Before proceeding with modeling, we aim to investigate if there is a strong spatial component to the crime rates. Specifically, we want to visualize the relationship between the state and both violent and non-violent crime intensities.
state_crime_summary <- datos_model %>%
group_by(state) %>%
summarise(
avg_violent = mean(violent_crimes_per_pop, na.rm = TRUE),
avg_nonviolent = mean(non_viol_per_pop, na.rm = TRUE)
) %>%
mutate(state = toupper(str_trim(state))) # Standardize to "FL", "NY", etc.# Top 10 States by Violent Crime Average
p1 <- state_crime_summary %>%
top_n(10, avg_violent) %>%
ggplot(aes(x = reorder(state, avg_violent), y = avg_violent)) +
geom_col(fill = "darkred", alpha = 0.8) +
coord_flip() +
labs(title = "Top Violent Crime States", x = "", y = "Rate") +
theme_minimal()
# Top 10 States by Non-Violent Crime Average
p2 <- state_crime_summary %>%
top_n(10, avg_nonviolent) %>%
ggplot(aes(x = reorder(state, avg_nonviolent), y = avg_nonviolent)) +
geom_col(fill = "steelblue", alpha = 0.8) +
coord_flip() +
labs(title = "Top Non-Violent Crime States", x = "", y = "Rate") +
theme_minimal()
# Display plots
p1 + p2state_crime_summary$state <- str_trim(state_crime_summary$state)
state_crime_summary$state <- str_to_title(state_crime_summary$state)
plot_usmap(data = state_crime_summary, values = "avg_violent", regions = "states") +
scale_fill_continuous(
low = "#FFD700",
high = "#8B0000",
name = "Viol Crime Rate",
label = scales::comma
) +
labs(title = "Geographic Distribution of Violent Crime",
subtitle = "Intensity of violent crimes per 100K inhabitants") +
theme(legend.position = "right")plot_usmap(data = state_crime_summary, values = "avg_nonviolent", regions = "states") +
scale_fill_continuous(
low = "#E0FFFF",
high = "#000080",
name = "Non-Violent Rate",
label = scales::comma
) +
labs(title = "Geographic Distribution of Non-Violent Crime",
subtitle = "Intensity of non-violent crimes per 100K inhabitants") +
theme(legend.position = "right")The maps above reveal noticeable gaps (gray areas). To understand the extent of this issue, we inspect the data format and identify exactly which states are absent from the dataset:
# Check format of the state column
head(state_crime_summary$state, 10) [1] "Ak" "Al" "Ar" "Az" "Ca" "Co" "Ct" "Dc" "De" "Fl"
# Identify missing states by comparing with the official US abbreviation list
state_crime_summary$state <- toupper(state_crime_summary$state)
missing_states <- setdiff(state.abb, state_crime_summary$state)
cat("States completely missing from the dataset:\n",missing_states)States completely missing from the dataset:
HI IL KS MI MT NE VT
Feature Selection: Elimination of the variable
Based on the analysis above, the variable stateis going to be excluded from the predictive modeling features. This decision is based on two main reasons:
Data Completeness & Bias: As shown in the diagnostics, several states are completely missing from the dataset. Using state as a predictor would introduce a significant bias, as the model would be unable to make predictions for these regions or would have to impute them blindly.
Ethical Modeling: Relying on location labels can inadvertently act as a proxy for demographic profiling (similar to redlining). We want the model to learn from the explicit socioeconomic factors (such as income or education), not simply where a community is located.
# Dropping the state variable from the dataset
datos_model <- datos_model %>%
dplyr::select(-state)3.3 Predictor Variables Overview
We examine the distribution of key predictor variables across different categories: demographic, socioeconomic, housing, and family structure. Understanding these distributions helps us identify potential issues (skewness, outliers, multimodality) that may affect modeling.
3.3.1 Key Demographic Variables
Demographic composition fundamentally shapes community dynamics and crime patterns through mechanisms of social organization, opportunity structures, and cultural norms.
3.3.1.1 Population Density
datos_model %>%
ggplot(aes(x = pop_dens)) +
geom_histogram(bins = 50,
fill = viridis(1, option = "C", alpha = 0.85),
color = "white", linewidth = 0.2) +
labs(
title = "Distribution of Population Density",
x = "Population Density (people per sq mile)",
y = "Count"
) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))Population density exhibits extreme right-skewness, with most communities having low to moderate density but a few urban centers reaching very high density. This severe skewness suggests that a transformation will need to be considered for linear modeling. High-density urban areas present unique crime dynamics (anonymity, opportunity, social disorganization) distinct from rural communities.
3.3.1.2 Racial Composition
datos_model %>%
dplyr::select(
White = race_pct_white,
Black = racepctblack,
Hispanic = race_pct_hisp,
) %>%
pivot_longer(everything(), names_to = "race", values_to = "percentage") %>%
mutate(race = factor(race, levels = c("White", "Black", "Hispanic"))) %>%
ggplot(aes(x = percentage, fill = race)) +
geom_histogram(bins = 30, color = "white", linewidth = 0.2) +
facet_wrap(~race, scales = "free_y", nrow =1, ncol = 3) +
scale_fill_viridis_d(option = "D", alpha = 0.85) +
labs(
title = "Distribution of Racial Composition",
x = "Percentage of Population",
y = "Count"
) +
theme_classic(base_size = 14) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
legend.position = "none",
strip.background = element_rect(fill = "grey90", color = "white")
)Racial composition varies dramatically across communities, reflecting historical settlement patterns, migration flows, and segregation. White percentage shows left-skewness (most communities are majority-white), while Black and Hispanic percentages show right-skewness (concentrated in certain communities). This heterogeneity is important for understanding structural inequality and its relationship with crime.
crime_race_data <- datos_model %>%
dplyr::select(violent_crimes_per_pop, non_viol_per_pop,
race_pct_white, racepctblack, race_pct_hisp) %>%
pivot_longer(cols = c(race_pct_white, racepctblack, race_pct_hisp),
names_to = "race_group",
values_to = "percentage") %>%
mutate(race_group = factor(race_group,
levels = c("race_pct_white", "racepctblack", "race_pct_hisp"),
labels = c("White", "Black", "Hispanic")))
# Violent Crime vs. Race
p_viol <- crime_race_data %>%
ggplot(aes(x = percentage, y = violent_crimes_per_pop, color = race_group)) +
geom_point(alpha = 0.1, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, color = "black", linewidth = 0.6) +
facet_wrap(~race_group, scales = "free_x", nrow = 1) +
scale_color_viridis_d(option = "turbo") +
labs(title = "Violent Crime vs. Racial Composition",
y = "Violent Rate (per 100K)", x = "") +
theme_bw() +
theme(legend.position = "none", plot.title = element_text(face = "bold", size = 11))
p_viol # Non-Violent Crime vs. Race
p_nonviol <- crime_race_data %>%
ggplot(aes(x = percentage, y = non_viol_per_pop, color = race_group)) +
geom_point(alpha = 0.1, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, color = "black", linewidth = 0.6) +
facet_wrap(~race_group, scales = "free_x", nrow = 1) +
scale_color_viridis_d(option = "turbo") +
labs(title = "Non-Violent Crime vs. Racial Composition",
y = "Non-Violent Rate (per 100K)", x = "Percentage of Population") +
theme_bw() +
theme(legend.position = "none", plot.title = element_text(face = "bold", size = 11))
p_nonviolAnalyzing these plots, high percentages of minority populations are not randomly distributed; they are concentrated in a small number of specific communities (the “tails” of the histograms). This suggests that in our model, these variables may act as proxies for specific urban centers or regions rather than just demographic indicators.
3.3.2 Key Socioeconomic Variables
Economic conditions are among the strongest predictors of crime, operating through strain (economic desperation), opportunity (employment alternatives), and community investment (social organization).
3.3.2.1 Median Income
datos_model %>%
ggplot(aes(x = med_income)) +
geom_histogram(bins = 50, fill = "#00204D", color = "white") + # Azul oscuro cividis
labs(title = "Median Income Distribution", x = "Income ($)", y = "Count") +
theme_classic() + theme(plot.title = element_text(face = "bold"))Median income shows moderate right-skewness with substantial variation across communities. The distribution reveals economic stratification, with most communities clustering around moderate income levels but a long tail of wealthy communities. Income is theoretically linked to crime through economic strain (poverty increases motivation) and community resources (wealth enables crime prevention).
3.3.2.2 Poverty Rate
datos_model %>%
ggplot(aes(x = pct_pop_under_pov)) +
geom_histogram(bins = 50, fill = "#7C7B78", color = "white") + # Gris cividis
labs(title = "Poverty Rate Distribution", x = "Poverty Rate (%)", y = "Count") +
theme_classic() + theme(plot.title = element_text(face = "bold"))Poverty rates vary from near-zero to over 50%, capturing extreme economic disparities. High poverty is consistently linked to crime through strain theory (blocked legitimate opportunities), social disorganization (weakened community institutions), and concentrated disadvantage (cumulative effects of multiple deprivations).
3.3.2.3 Unemployment Rate
datos_model %>%
ggplot(aes(x = pct_unemployed)) +
geom_histogram(bins = 50, fill = "#FFEA46", color = "white") + # Amarillo cividis
labs(title = "Unemployment Rate Distribution", x = "Unemployment (%)", y = "Count") +
theme_classic() + theme(plot.title = element_text(face = "bold"))Unemployment shows an approximately normal distribution, though with right-skewness indicating some communities face severe joblessness. Unemployment relates to crime through reduced legitimate opportunities, economic strain on families, and idle time (routine activities theory).
3.3.2.4 Housing Decay
We examine the percentage of housing units that are vacant and boarded up. This is a critical indicator for the Broken Windows Theory, which posits that visible signs of physical deterioration signal a lack of social order, emboldening opportunistic criminals and drug activity.
datos_model %>%
ggplot(aes(x = pct_vacant_boarded)) +
geom_histogram(bins = 50, fill = "darkgreen", color = "white", linewidth = 0.2) +
labs(
title = "Distribution of Vacant/Boarded Housing",
x = "Percentage of Units Boarded Up",
y = "Count"
) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))The distribution is extremely right-skewed. The vast majority of communities have near-zero vacant houses, reflecting healthy neighborhoods. However, the outlier communities, where significant portions of housing are abandoned, represent areas of severe urban blight. In our modeling, these non-zero values are likely to be strong predictors of high-crime hotspots.
3.3.2.5 Immigration
We analyze the percentage of recent immigrants to test competing criminological theories. Social Disorganization Theory suggests that rapid demographic changes and language barriers might temporarily weaken community cohesion, while the Immigrant Paradox suggests that first-generation immigrants often exhibit lower crime rates than native-born citizens due to stronger family ties and work ethic.
datos_model %>%
ggplot(aes(x = pct_immig_recent)) +
geom_histogram(bins = 50, fill = "darkred", color = "white", linewidth = 0.2) + # Yellow/Green
labs(
title = "Distribution of Foreign Born Population",
x = "Percentage Foreign Born",
y = "Count"
) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))# Scatter Plot: Immigration vs Crime Rates
immig_crime_data <- datos_model %>%
dplyr::select(pct_immig_recent, violent_crimes_per_pop, non_viol_per_pop) %>%
pivot_longer(cols = c(violent_crimes_per_pop, non_viol_per_pop),
names_to = "crime_type",
values_to = "rate") %>%
mutate(crime_type = factor(crime_type,
levels = c("violent_crimes_per_pop", "non_viol_per_pop"),
labels = c("Violent Crime", "Non-Violent Crime")))
p2 <- immig_crime_data %>%
ggplot(aes(x = pct_immig_recent, y = rate, color = crime_type)) +
geom_point(alpha = 0.5, size = 1) +
geom_smooth(method = "lm", color = "black", linewidth = 0.8) +
facet_wrap(~crime_type, scales = "free_y") +
scale_color_viridis_d(option = "D", begin = 0.3, end = 0.8) +
labs(
title = "Correlation: Immigration vs. Crime",
x = "Percentage Foreign Born",
y = "Crime Rate (per 100K)"
) +
theme_bw() +
theme(plot.title = element_text(face = "bold"), legend.position = "none")
p2Both crime types show positive correlations with recent immigration. However, the extreme variance is more telling: communities with identical immigration rates exhibit crime levels differing by 10-15 fold.
This suggests immigration is a marker of urban areas (where both immigration and crime concentrate) rather than a direct cause. The relationship likely reflects confounding; immigrants settle in high-density, lower-income neighborhoods that already have elevated crime rates due to structural disadvantage.
3.3.3 Key Family Structure Variables
Family structure reflects informal social control: the capacity of families and communities to supervise youth and maintain order without formal policing. We examine not just the prevalence of two-parent families, but the interplay between family instability and economic strain.
3.3.3.1 Family Risk Index
Rather than examining individual family metrics separately, we use a pre-constructed composite family_risk_index that captures overall household instability through divorce rates, single-parent prevalence, and non-marital births.
# Distribution of the index
datos_model %>%
ggplot(aes(x = family_risk_index)) +
geom_histogram(bins = 50,
fill = viridis(1, option = "magma", alpha = 0.85),
color = "white", linewidth = 0.2) +
labs(
title = "Family Risk Index Distribution",
x = "Family Risk Index",
y = "Count"
) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))# Relationship with crime
family_crime <- datos_model %>%
dplyr::select(family_risk_index, violent_crimes_per_pop, non_viol_per_pop) %>%
filter(!is.na(family_risk_index)) %>%
pivot_longer(cols = c(violent_crimes_per_pop, non_viol_per_pop),
names_to = "crime_type",
values_to = "rate") %>%
mutate(crime_type = factor(crime_type,
levels = c("violent_crimes_per_pop", "non_viol_per_pop"),
labels = c("Violent Crime", "Non-Violent Crime")))
family_crime %>%
ggplot(aes(x = family_risk_index, y = rate, color = crime_type)) +
geom_point(alpha = 0.4, size = 1) +
geom_smooth(method = "lm", se = TRUE, color = "black", linewidth = 0.5) +
facet_wrap(~crime_type, scales = "free_y") +
scale_color_viridis_d(option = "C", begin = 0.2, end = 0.8) +
labs(
title = "Family Risk as Crime Predictor",
x = "Family Risk Index",
y = "Crime Rate per 100K"
) +
theme_bw() +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
legend.position = "none"
)3.4 Correlation Analysis
After examining individual distributions, we now determine which specific factors have the strongest statistical relationship with crime. This ranking helps prioritize feature selection for our predictive models.
3.4.1 Violent Crimes
In this section, we calculate the Pearson correlation of all predictors against violent_crimes_per_pop and display the strongest relationships (both positive and negative).
# Calculate correlations of all numeric variables
cor_rankings <- datos_model %>%
dplyr::select(where(is.numeric)) %>%
dplyr::select(-non_viol_per_pop) %>%
cor(use = "pairwise.complete.obs") %>%
as.data.frame() %>%
dplyr::select(correlation = violent_crimes_per_pop) %>%
mutate(variable = rownames(.)) %>%
filter(variable != "violent_crimes_per_pop") %>%
arrange(desc(abs(correlation))) %>%
head(10)
# Lollipop Plot
cor_rankings %>%
ggplot(aes(x = reorder(variable, correlation), y = correlation)) +
geom_segment(aes(x = reorder(variable, correlation), xend = variable, y = 0, yend = correlation), color = "grey") +
geom_point(aes(color = correlation > 0), size = 4) +
coord_flip() + # Flip coordinates for readability
scale_color_manual(values = c("steelblue", "darkred"), labels = c("Negative", "Positive")) +
labs(
title = "Top Predictors of Violent Crime",
subtitle = "Pearson Correlation Coefficients",
x = "",
y = "Correlation Strength",
color = "Direction"
) +
theme_bw(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
legend.position = "bottom"
)The top predictors for Violent Crime exhibit very strong correlations, with coefficients exceeding absolute values of 0.7. The ranking is heavily dominated by our engineered feature, family_risk_index, which confirms that combining family instability metrics into a single index successfully captured the predictive signal. Alongside this, demographic and economic indicators such as racepctblack and pct_pop_under_povremain critical, suggesting that violent crime in this dataset is strongly deterministic and deeply rooted in structural disadvantage.
3.4.2 Non-Violent Crimes
In this section, we calculate the Pearson correlation of all predictors against non_viol_per_pop and display the strongest relationships (both positive and negative).
# Calculate correlations of all numeric variables
cor_rankings <- datos_model %>%
dplyr::select(where(is.numeric)) %>%
dplyr::select(-violent_crimes_per_pop) %>%
cor(use = "pairwise.complete.obs") %>%
as.data.frame() %>%
dplyr::select(correlation = non_viol_per_pop ) %>%
mutate(variable = rownames(.)) %>%
filter(variable != "non_viol_per_pop") %>%
arrange(desc(abs(correlation))) %>%
head(10)
# Lollipop Plot
cor_rankings %>%
ggplot(aes(x = reorder(variable, correlation), y = correlation)) +
geom_segment(aes(x = reorder(variable, correlation), xend = variable, y = 0, yend = correlation), color = "grey") +
geom_point(aes(color = correlation > 0), size = 4) +
coord_flip() + # Flip coordinates for readability
scale_color_manual(values = c("steelblue", "darkred"), labels = c("Negative", "Positive")) +
labs(
title = "Top Predictors of Non-Violent Crime",
subtitle = "Pearson Correlation Coefficients",
x = "",
y = "Correlation Strength",
color = "Direction"
) +
theme_bw(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
legend.position = "bottom"
)A significant overlap exists between the drivers of violent and non-violent crime, with family_risk_index, pct_pop_under_pov, and med_income acting as universal predictors. This shared structure provides a strong foundation for our upcoming modeling phase. The high consistency across targets suggests that a Multivariate Linear Regression (predicting multiple outcomes simultaneously) could be highly effective, as the same core set of socioeconomic factors appears to drive overall criminality regardless of the specific offense type.
3.5 Handling of Extreme Values
While statistical tests flag certain observations as outliers, a qualitative review confirms that these data points represent valid community profiles rather than data entry errors. These communities exist in reality and often represent the most critical cases for our model to understand. Removing them would artificially truncate the natural variance of the data and introduce bias, potentially hindering the model’s ability to predict crime in high-risk areas where accurate forecasting is most needed.
4 Linear Regression
Linear regression is a fundamental statistical method for modeling the relationship between a dependent variable and one or more independent variables. The model assumes a linear relationship of the form:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon\]
Where:
- \(Y\): is the dependent variable (target)
- \(X_1, X_2, \dots, X_p\): are independent variables (predictors)
- \(\beta_0\): is the intercept
- \(\beta_1, \beta_2, \dots, \beta_p\): are coefficients
- \(\epsilon\): is the error term
For our analysis, we will build separate models for violent and non-violent crime rates, as they represent distinct but related outcomes.
4.1 Data preparation
4.1.1 Target Variable Transformation
As identified in our EDA, both target variables require transformation to meet the normality assumption of linear regression. We apply Box-Cox transformations with the optimal \(\lambda\) values:
#This was the final dataset that we were working on:
datos_model <- datos_clean2 %>%
dplyr::select(-all_of(drop_names)) %>%
dplyr::select(-any_of("state"))# Apply Box-Cox transformations to target variables
datos_model <- datos_model %>%
mutate(
# Box-Cox transformation for violent crimes
viol_trans = ((violent_crimes_per_pop + 1)^0.115 - 1) / 0.115,
# Box-Cox transformation for non-violent crimes
nonviol_trans = ((non_viol_per_pop + 1)^0.246 - 1) / 0.246
)
# Let's verify the transformations are effective
cat("Original violent crime rate - Skewness:",
moments::skewness(datos_model$violent_crimes_per_pop, na.rm = TRUE), "\n")Original violent crime rate - Skewness: 2.102536
cat("Transformed violent crime - Skewness:",
moments::skewness(datos_model$viol_trans, na.rm = TRUE), "\n\n")Transformed violent crime - Skewness: -0.01841899
cat("Original non-violent crime rate - Skewness:",
moments::skewness(datos_model$non_viol_per_pop, na.rm = TRUE), "\n")Original non-violent crime rate - Skewness: 1.58542
cat("Transformed non-violent crime - Skewness:",
moments::skewness(datos_model$nonviol_trans, na.rm = TRUE), "\n")Transformed non-violent crime - Skewness: 0.007598315
# We eliminate this variables from the dataset to avoid any possible problems in the future
datos_model <- datos_model %>%
dplyr::select(-any_of(c("violent_crimes_per_pop", "non_viol_per_pop")))
cat("Transformed variables created and original variables removed.\n")Transformed variables created and original variables removed.
cat("Dataset dimensions:", nrow(datos_model), "rows,", ncol(datos_model), "columns\n\n")Dataset dimensions: 1902 rows, 20 columns
We also create an inverse Cox-Box transformation that will be used in the future:
inverse_boxcox <- function(y_transformed, lambda) {
if (lambda == 0) {
# Original transformation: log(y + 1)
# Inverse: exp(y_transformed) - 1
y_original <- exp(y_transformed) - 1
} else {
# Original transformation: [(y + 1)^lambda - 1] / lambda
# Inverse: [(lambda * y_transformed + 1)^(1/lambda)] - 1
y_original <- ((lambda * y_transformed + 1)^(1/lambda)) - 1
}
y_original <- pmax(y_original, 0)
return(y_original)
}4.1.2 Training Split
Rather than a single train-test split, we employ k-fold cross-validation to obtain more robust performance estimates. This approach repeatedly partitions the data into k subsets, using each subset once as a test set while training on the remaining k-1 subsets. This reduces variance in our performance metrics and provides a more reliable assessment of model generalization. We use k=10 folds, which offers a good balance between computational efficiency and bias-variance tradeoff for our dataset size (\(n = 1902\)).
# Set seed for reproducibility
set.seed(2025)
# Create k-fold splits
k_folds <- 10
# Create fold assignments
datos_model <- datos_model %>%
mutate(fold_id = sample(rep(1:k_folds, length.out = n())))
cat("Number of folds:", k_folds, "\n")Number of folds: 10
cat("Approximate observations per fold:", round(nrow(datos_model)/k_folds), "\n\n")Approximate observations per fold: 190
Also, a custom Cross-Validation function will be defined, in order to directly run the complete pipeline and compute some key performance metrics:
cross_validate_model <- function(formula, data, target_name, model_name, folds = 10) {
set.seed(2025)
metrics_list <- list()
for (fold in 1:folds) {
# Split data
test_indices <- which(data$fold_id == fold)
train_data <- data[-test_indices, ]
test_data <- data[test_indices, ]
# Fit model
model <- lm(formula, data = train_data)
# Make predictions
predictions <- predict(model, newdata = test_data)
# Get actual values
actual <- test_data[[target_name]]
# Calculate metrics (with NA handling)
mse <- mean((actual - predictions)^2, na.rm = TRUE)
rmse <- sqrt(mse)
mae <- mean(abs(actual - predictions), na.rm = TRUE)
# R^2 calculation
ss_res <- sum((actual - predictions)^2, na.rm = TRUE)
ss_tot <- sum((actual - mean(actual, na.rm = TRUE))^2, na.rm = TRUE)
r2 <- 1 - (ss_res / ss_tot)
# Store metrics
metrics_list[[fold]] <- data.frame(
Fold = fold,
Model = model_name,
Target = target_name,
RMSE = rmse,
MAE = mae,
R2 = r2
)
}
# Combine results
results_df <- bind_rows(metrics_list)
# Calculate summary statistics
summary_df <- results_df %>%
summarise(
Model = first(Model),
Target = first(Target),
Mean_RMSE = mean(RMSE),
SD_RMSE = sd(RMSE),
Mean_MAE = mean(MAE),
SD_MAE = sd(MAE),
Mean_R2 = mean(R2),
SD_R2 = sd(R2)
)
return(list(detailed = results_df, summary = summary_df))
}4.1.2.1 Why Cross-Validation for RL?
In predictive modeling, evaluating a model solely on the dataset used for estimation leads to biased and overly optimistic performance metrics. In-sample measures such as \(R^2\) tend to underestimate the true generalization error, particularly when multiple model specifications are explored. This issue is closely related to the bias–variance trade-off: models that fit the training data very well may capture noise rather than underlying structure, resulting in poor out-of-sample performance.
To obtain a more reliable estimate of predictive accuracy, we evaluate all linear regression models using K-fold cross-validation. This resampling approach approximates the expected out-of-sample error by repeatedly training the model on \(K-1\) folds and validating it on the remaining fold, averaging the results across all partitions. Cross-validation reduces both estimation variance and model selection bias, making it especially suitable for heuristic model comparison. Consequently, model choice in this study is based on cross-validated metrics (RMSE, MAE, and \(R^2\)), ensuring that selected models balance explanatory power with robust generalization to unseen communities.
All this information was obtained from an article, that has been linked and mentioned in the References Section.
4.2 Model Development and Evaluation Metrics
For each of the target variables, different and variate models will be developed and compared . The following criteria will be used ranging from simple to complex:
-Model V1: Single best predictor
-Model V2: Top 3 correlated predictors
-Model V3: Heuristic/Parsimonious model (theory driven)
-Model V4: Interaction model
-Model V5: Full model (all 18 predictors)
-Model V6: PCA-Based model (dimensionality reduction)
The goal is to balance explanatory power with parsimony to avoid overfitting while maintaining interpretability.
One of the most important parts of the project consists on the explanation of the criteria used for choosing a model over the others. In this case, two primary metrics will be considered to evaluate the model performance:
- Coefficient of Determination: \(R^2\)
\(R^2\) measures the proportion of variance in the target variable explained by the model. The higher \(R^2\), the better fitted is the model. It ranges from 0 to 1.
\[R^2 = 1 - \frac{SS_{residual}}{SS_{total}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\] \(SS_{residual}\) (Residual Sum of Squares): Also known as \(RSS\). It measures the sum of the squared differences between the actual observations and the model’s predictions. It represents the “unexplained” variation.
\(SS_{total}\) (Total Sum of Squares): Also known as \(TSS\). It measures the total variation in the observed data relative to the mean. It is proportional to the variance of the sample.
\(y_i\): The actual observed value for observation \(i\).
\(\hat{y}_i\): The predicted value for observation \(i\) generated by the regression model.
\(\bar{y}\): The arithmetic mean of all observed values of the target variable \(y\).
The \(R^2\) serves as a primary measure of goodness of fit. A value closer to 1 indicates that the model accounts for a large portion of the variance, suggesting a strong fit to the data. However, in our predictive context, a high \(R^2\) on the training data does not always guarantee high performance on unseen data.
- Root Mean Squared Error (RMSE)
To assess the average magnitude of the error in our predictions, we use the Root Mean Squared Error (RMSE). This metric is defined as:
\[RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\]
\(n\): The total number of observations in the sample.
\(y_i\): The actual observed value for community \(i\).
\(\hat{y}_i\): The predicted value for community \(i\).
The RMSE provides a clear measure of how much, on average, the model’s predictions deviate from the actual values. A lower RMSE indicates a more precise model with higher predictive accuracy. The specific utility of RMSE lies in the squaring of the residuals before they are averaged. This mathematical property means that the metric is highly sensitive to large errors. In a criminological context, this is a critical advantage for handling extreme communites and for managing risk.
In some cases, we will also use a third metric, called Mean Absolute Error (MAE), that measures the average magnitude of the errors in a set of predictions, without considering their direction. It is a linear score which means that all individual differences are weighted equally in the average. It is defined as:
\[MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\] where \(n\), \(y_i\) and \(\hat{y}_i\) have the same meaning as in RMSE. It is an interesting option because it is a linear metric, and it is also more robust to outliers than RMSE.
Then,, we select the best model by balancing:
High CV \(R^2\): Strong explanatory power that generalizes to new data
Low CV RMSE: Small prediction errors
Parsimony: Fewer predictors are preferred when performance differences are marginal
Interpretability: Models with clear theoretical grounding and policy implications
4.3 Violent Crime Regression Models
First, let’s start with the models associated to the prediction and interpretability of the variable viol_trans.
4.3.1 Model Comparison
In this section, all the different models will be trained and analyzed briefly, but not compared. The comparison will take place in the next section.
4.3.1.1 Model V1: Single Best Predictor
We begin with the simplest possible model, using only family_risk_index, which showed the strongest correlation with violent crime in our EDA.
# Define formula
formula_v1 <- viol_trans ~ family_risk_index
# Cross-validate for performance assessment
cv_v1 <- cross_validate_model(formula_v1, datos_model, "viol_trans", "V1: Single", k_folds)
# Results
cat(" Mean R^2:", round(cv_v1$summary$Mean_R2, 3),
"±", round(cv_v1$summary$SD_R2, 3), "\n Mean RMSE:", round(cv_v1$summary$Mean_RMSE, 3),
"±", round(cv_v1$summary$SD_RMSE, 3), "\n\n") Mean R^2: 0.588 ± 0.043
Mean RMSE: 1.345 ± 0.092
This baseline confirms that family structure alone captures substantial crime variance, validating its theoretical importance. However, a single-predictor model likely underfits the complexity of violent crime determinants.
4.3.1.4 Model V4: Interaction Effects
Before building this model, we hypothesize that the effect of family instability on violent crime is amplified in poor communities. Mathematically, this means that the impact of family_risk_index is not constant—it depends on the level of pct_pop_under_pov.
In regression terms: \[\text{Crime} = \beta_0 + \beta_1 \cdot \text{Family Risk} + \beta_2 \cdot \text{Poverty} + \beta_3 \cdot (\text{Family Risk} \times \text{Poverty}) + \epsilon\]
If \(\beta_3 \neq 0\), the relationship is non-additive. This aligns with cumulative disadvantage theory: communities experiencing both family breakdown and economic deprivation suffer disproportionately higher crime than the sum of individual effects would predict.
formula_v4 <- viol_trans ~ family_risk_index + pct_pop_under_pov +
racepctblack + pct_vacant_boarded +
family_risk_index:pct_pop_under_pov
cv_v4 <- cross_validate_model(formula_v4, datos_model, "viol_trans", "V4: Interaction", k_folds)
cat(" Mean R^2:", round(cv_v4$summary$Mean_R2, 3),
"±", round(cv_v4$summary$SD_R2, 3), "\n Mean RMSE:", round(cv_v4$summary$Mean_RMSE, 3),
"±", round(cv_v4$summary$SD_RMSE, 3), "\n\n") Mean R^2: 0.598 ± 0.055
Mean RMSE: 1.327 ± 0.104
Tests whether the relationship between family structure and crime is context-dependent on economic conditions.
4.3.1.5 Model V5: Full Model (All Predictors)
formula_v5 <- viol_trans ~ family_risk_index + racepctblack +
pct_pop_under_pov + pct_vacant_boarded + pct_unemployed +
pct_not_hs_grad + pct_larg_house_fam + med_income +
pct_hous_own_occ + race_pct_white + pct_immig_recent +
population + pop_dens + pct_not_speak_engl_well +
race_pct_hisp + pct_hous_occup + pct_b_sor_more + rent_median
cv_v5 <- cross_validate_model(formula_v5, datos_model, "viol_trans", "V5: Full", k_folds)
cat(" Mean R^2:", round(cv_v5$summary$Mean_R2, 3),
"±", round(cv_v5$summary$SD_R2, 3), "\n Mean RMSE:", round(cv_v5$summary$Mean_RMSE, 3),
"±", round(cv_v5$summary$SD_RMSE, 3), "\n\n") Mean R^2: 0.639 ± 0.047
Mean RMSE: 1.258 ± 0.097
Maximum complexity benchmark. Establishes the performance ceiling but likely overfits and sacrifices interpretability.
4.3.1.6 Model V6: PCA-Based Model
# Prepare predictor matrix
predictors_pca <- datos_model %>%
dplyr::select(-viol_trans, -nonviol_trans, -fold_id) %>%
dplyr::select(where(is.numeric))
# Perform PCA
pca_result <- PCA(predictors_pca, scale.unit = TRUE, ncp = 10, graph = FALSE)
# Variance explained
variance_table <- data.frame(
PC = 1:10,
Variance_Pct = round(pca_result$eig[1:10, "percentage of variance"], 2),
Cumulative_Pct = round(cumsum(pca_result$eig[1:10, "percentage of variance"]), 2)
)
kable(variance_table, caption = "PCA Variance Decomposition")| PC | Variance_Pct | Cumulative_Pct | |
|---|---|---|---|
| comp 1 | 1 | 38.26 | 38.26 |
| comp 2 | 2 | 15.55 | 53.81 |
| comp 3 | 3 | 10.25 | 64.06 |
| comp 4 | 4 | 7.39 | 71.45 |
| comp 5 | 5 | 5.57 | 77.02 |
| comp 6 | 6 | 5.11 | 82.13 |
| comp 7 | 7 | 4.52 | 86.64 |
| comp 8 | 8 | 3.35 | 89.99 |
| comp 9 | 9 | 2.40 | 92.39 |
| comp 10 | 10 | 1.81 | 94.21 |
# Select components
n_components <- which(variance_table$Cumulative_Pct >= 80)[1]
cat("\nComponents needed for 80% variance:", n_components, "\n\n")
Components needed for 80% variance: 6
# Extract PC scores
pc_scores <- as.data.frame(pca_result$ind$coord[, 1:n_components])
colnames(pc_scores) <- paste0("PC", 1:n_components)
# Create dataset with PCs
datos_pca <- cbind(
viol_trans = datos_model$viol_trans,
pc_scores,
fold_id = datos_model$fold_id
)
# Build formula and cross-validate
formula_v6 <- as.formula(paste("viol_trans ~", paste(colnames(pc_scores), collapse = " + ")))
cv_v6 <- cross_validate_model(formula_v6, datos_pca, "viol_trans", "V6: PCA", k_folds)
cat(" Mean R^2:", round(cv_v6$summary$Mean_R2, 3),
"±", round(cv_v6$summary$SD_R2, 3), "\n Mean RMSE:", round(cv_v6$summary$Mean_RMSE, 3),
"±", round(cv_v6$summary$SD_RMSE, 3), "\n\n") Mean R^2: 0.541 ± 0.083
Mean RMSE: 1.415 ± 0.118
Demonstrates dimensionality reduction as a methodological alternative. However, principal components are uninterpretable linear combinations, making this unsuitable for policy guidance.
4.3.2 Violent Crime Model Comparison
Let’s choose the best model for predicting our target variable now:
# Consolidate CV summaries
comparison_violent <- bind_rows(
cv_v1$summary %>% mutate(N_Predictors = 1),
cv_v2$summary %>% mutate(N_Predictors = 3),
cv_v3$summary %>% mutate(N_Predictors = 7),
cv_v4$summary %>% mutate(N_Predictors = 5),
cv_v5$summary %>% mutate(N_Predictors = 18),
cv_v6$summary %>% mutate(N_Predictors = n_components)
) %>%
dplyr::select(Model, N_Predictors, Mean_R2, SD_R2, Mean_RMSE, SD_RMSE) %>%
arrange(desc(Mean_R2))
kable(comparison_violent, digits = 4,
caption = "Violent Crime: Cross-Validation Performance Comparison")| Model | N_Predictors | Mean_R2 | SD_R2 | Mean_RMSE | SD_RMSE |
|---|---|---|---|---|---|
| V5: Full | 18 | 0.6391 | 0.0472 | 1.2579 | 0.0968 |
| V4: Interaction | 5 | 0.5981 | 0.0545 | 1.3274 | 0.1045 |
| V3: Theory | 7 | 0.5912 | 0.0422 | 1.3398 | 0.0854 |
| V1: Single | 1 | 0.5881 | 0.0426 | 1.3452 | 0.0920 |
| V2: Top 3 | 3 | 0.5877 | 0.0424 | 1.3459 | 0.0926 |
| V6: PCA | 6 | 0.5411 | 0.0835 | 1.4146 | 0.1179 |
4.3.2.1 Visual Comparison of Models
# Create comparison plots
p1 <- ggplot(comparison_violent, aes(x = reorder(Model, Mean_R2), y = Mean_R2)) +
geom_col(fill = "green", alpha = 1) +
geom_errorbar(aes(ymin = Mean_R2 - SD_R2, ymax = Mean_R2 + SD_R2),
width = 0.2) +
coord_flip() +
labs(title = "Model Comparison: R²", x = "", y = "Mean R² (± SD)") +
theme_classic()
p2 <- ggplot(comparison_violent, aes(x = reorder(Model, -Mean_RMSE), y = Mean_RMSE)) +
geom_col(fill = "darkred", alpha = 1) +
geom_errorbar(aes(ymin = Mean_RMSE - SD_RMSE, ymax = Mean_RMSE + SD_RMSE),
width = 0.2) +
coord_flip() +
labs(title = "Model Comparison: RMSE", x = "", y = "Mean RMSE (± SD)") +
theme_classic()
p1 + p2 + plot_annotation(title = "Cross-Validation Performance: All Models")4.3.2.2 Best Model Selection
We will not consider Model V5 (Full Model) even if it is the model with the highest \(R^2\) metric, because it is the most complex and less interpretable model out of the 6. We are also not interested in model V6 (PCA), as it is uninterpretable (and the worst in terms of \(R^2\)!). Also, Model V1 will be eliminated as it is too simple (it is the baseline model).
Surprisingly, Model V2 (Top 3 Predictors) does not improve upon the Baseline (V1), maintaining an \(R^2\) of 0.588 and even slightly increasing the RMSE. This suggests a high degree of redundancy among the predictors. The family_risk_index likely acts as a proxy for both economic strain (pct_pop_under_pov) and structural disadvantage, capturing the maximum explainable variance on its own. Adding the other two variables introduces complexity without providing marginal information, which leads to a negligible decrease in predictive stability during cross-validation.
Finally, from the models left, the model V4 (Interactions) has the highest Mean \(R^2\), while having a controlled Standard Deviation of the metric, and it is not very complex (only 5 variables), so this will be the final model for the Violent Crimes’ Variable.
The selected model formula is:
print(formula_v4)viol_trans ~ family_risk_index + pct_pop_under_pov + racepctblack +
pct_vacant_boarded + family_risk_index:pct_pop_under_pov
Now, let’s fit all the data to our selected model:
best_model_v <- lm(formula_v4, data = datos_model)4.3.2.3 Interaction Term Significance Test
To justify the selection of Model V4, we formally test whether the interaction between family instability and poverty is statistically significant. A significant interaction indicates that these two social forces do not act independently but instead amplify or moderate each other.
coef_matrix <- summary(best_model_v)$coefficients
# Identify the interaction term (automatically finds the row with ':')
int_row_name <- grep(":", rownames(coef_matrix), value = TRUE)
interaction_stats <- coef_matrix[int_row_name, ]
cat("Term:", int_row_name, "\n")Term: family_risk_index:pct_pop_under_pov
cat("Estimate:", round(interaction_stats["Estimate"], 5), "\n")Estimate: -0.00524
cat("t-value:", round(interaction_stats["t value"], 3), "\n")t-value: -7.682
cat("p-value:", format(interaction_stats["Pr(>|t|)"], scientific = TRUE), "\n\n")p-value: 2.493357e-14
The result is clear: a p-value of \(2.49 \times 10^{-14}\) is far below any standard significance level, confirming that the interaction effect is not just a statistical fluke but a core component of how these community factors operate. The negative estimate (-0.00524) tells us that while both poverty and family risk increase crime, their combined “boost” is slightly smaller than expected if they were acting independently.
4.3.3 Model Diagnostic
Once the winning model has been selected based on metrics, we must verify that the results are not misleading. This diagnostic phase confirms that the residuals (the part the model fails to explain) behave as random white noise. If we find patterns in the error, it means we are missing critical information or that the linear approach is not suitable for this data. We will now test the core assumptions to validate the coefficients and the predictions that follow.
4.3.3.1 Linearity
Linearity assumes that the relationship between the independent variables and the transformed target follows a straight-line pattern. Verifying this assumption is essential because Ordinary Least Squares (OLS) regression estimates constant rates of change through its coefficients; if the true underlying relationship is non-linear or curved, the model will fail to capture the structural dynamics of the data. This would result in systematic prediction errors and biased estimates, rendering the model’s conclusions unreliable for community-level analysis.
plot(best_model_v, which = 1,
main = "Residuals vs Fitted",
sub.caption = "") In the plot, we look for a random distribution of points around the horizontal zero line. Our results show no visible “U-shape” or systematic curves, and the smoothed red line stays relatively flat and close to the dashed baseline. This indicates that the linear specification is appropriate for the dataset and confirms that the initial Box-Cox transformation successfully linearized the relationship between our socio-economic predictors and the crime rates, allowing the model to make consistent predictions across different community profiles.
4.3.3.2 Normality of Residuals
The normality assumption is required to ensure that our p-values and confidence intervals are accurate. If the residuals are not normally distributed, the statistical significance of our predictors might be unreliable. We use two main tools to evaluate this: a Histogram to check the overall shape and a Normal Q-Q Plot to see how the residuals align with the theoretical quantiles of a normal distribution. Regarding the formal test (Shapiro-Wilk), we include it for completeness, but it must be interpreted with caution. With a large sample size (\(N \approx 1900\)), formal tests are extremely sensitive and will almost always return a \(p < 0.05\) even for minor, practically irrelevant deviations. Therefore, we prioritize the Q-Q Plot. As long as the points follow the diagonal line for the majority of the range, we consider the model robust enough for predictive analysis.
par(mfrow = c(1, 2))
plot(best_model_v, which = 2)
hist(residuals(best_model_v), breaks = 50, main = "Residual Distribution",
xlab = "Residuals", col = "lightblue", border = "white")par(mfrow = c(1, 1))
# Shapiro-Wilk Test
residuals_v <- residuals(best_model_v)
shapiro_test_v <- shapiro.test(residuals_v)Before relying on the formal test, we look at the visual evidence. The Histogram shows a symmetric, bell-shaped distribution of errors centered at zero. In the Normal Q-Q Plot, the majority of the observations follow the diagonal reference line quite closely, especially in the central range. While there are slight deviations at the “tails” (the very high and very low crime areas), this is common in social science data where extreme communities often exhibit unique variance.
Now, let’s also do the Shapiro-Wilk Normality Test:
cat(" W-statistic:", round(shapiro_test_v$statistic, 4), "\n") W-statistic: 0.9934
cat(" p-value:", format(shapiro_test_v$p.value, scientific = TRUE), "\n\n") p-value: 1.683885e-07
Although the \(p\)-value is below 0.05, which technically rejects the null hypothesis of perfect normality, we must consider our sample size (\(N \approx 1900\)). In large datasets, the Shapiro-Wilk test becomes extremely sensitive to even the smallest, practically insignificant deviations.The W-statistic of 0.9934 is actually very close to 1, indicating that the residuals are 99.3% similar to a normal distribution. For predictive modeling and large-scale inference, this level of normality is more than sufficient. The Box-Cox transformation successfully addressed the heavy skewness seen in the raw data, and we can move forward knowing our standard errors and intervals are robust.
4.3.3.3 Homoscedasticity
Homoscedasticity means that the variance of the residuals is constant across all levels of the predicted values. If the variance changes (heteroscedasticity), the model’s predictive reliability fluctuates; for example, it might be very accurate for low-crime communities but highly erratic for high-crime ones. Ensuring constant variance is critical for the validity of the prediction intervals we will calculate later.
# Visualizing variance stability
plot(best_model_v, which = 3)The Scale-Location plot allows us to check this spread. We are looking for a horizontal red line with points scattered randomly around it, without forming a “funnel” or “fan” shape. In our model, the red line remains relatively flat, and the points show a consistent vertical spread across the range of fitted values. This suggests the variance is stable, but we will now apply mathematical rigor to confirm this observation.
Now, regarding the formal test:
bp_test_v <- bptest(best_model_v)
cat(" Breusch-Pagan Test statistic:", round(bp_test_v$statistic, 4), "\n") Breusch-Pagan Test statistic: 9.459
cat(" p-value:", format(bp_test_v$p.value, scientific = TRUE), "\n\n") p-value: 9.210001e-02
Since the p-value (0.0921) is greater than 0.05, we fail to reject the null hypothesis of homoscedasticity. This confirms that the model’s error variance is stable. This result is particularly positive for social science data, where heteroscedasticity is common. It ensures that our predictions will be equally reliable across the entire range of community crime rates.
4.3.3.4 Independence
The independence assumption requires that the residuals are not correlated with each other. In a cross-sectional dataset like Communities and Crime, violating this assumption (often through spatial autocorrelation) would mean that the error in one community could help predict the error in another. This would lead to underestimated standard errors and overconfident p-values. We use the Durbin-Watson test to ensure that no such systematic correlation exists in our model’s errors.
dw_test_v <- dwtest(best_model_v)
cat(" Durbin-Watson Test statistic:", round(dw_test_v$statistic, 4), "\n") Durbin-Watson Test statistic: 2.1236
cat(" p-value:", format(dw_test_v$p.value, scientific = TRUE), "\n\n") p-value: 9.965002e-01
The Durbin-Watson statistic is 2.1236, which is nearly perfect as it sits very close to the ideal value of 2.0 (representing zero autocorrelation). With a p-value of 0.9965, we clearly fail to reject the null hypothesis of independence.
This result confirms that the residuals behave as random noise and are not influenced by the order or spatial grouping of the observations. This adds another layer of validity to our model, ensuring that our statistical inferences regarding the predictors are reliable.
4.3.3.5 Multicollinearity
Multicollinearity occurs when independent variables are highly correlated, which can inflate the variance of the coefficient estimates and make them unstable. We use the Variance Inflation Factor (VIF) to detect this. Generally, a VIF above 10 is a warning sign, but this rule changes when the model includes interaction terms.
vif_values_v <- vif(best_model_v)
cat("Variance Inflation Factors (VIF):\n")Variance Inflation Factors (VIF):
vif_table <- data.frame(
Variable = names(vif_values_v),
VIF = round(vif_values_v, 2)
)
kable(vif_table, row.names = FALSE)| Variable | VIF |
|---|---|
| family_risk_index | 5.06 |
| pct_pop_under_pov | 10.16 |
| racepctblack | 2.75 |
| pct_vacant_boarded | 1.47 |
| family_risk_index:pct_pop_under_pov | 19.62 |
max_vif_v <- max(vif_values_v)
cat("\nMaximum VIF:", round(max_vif_v, 2), "\n\n")
Maximum VIF: 19.62
The table shows a maximum VIF of 19.62 for the interaction term and 10.16 for poverty. While these numbers exceed the traditional threshold of 10, this is a case of structural multicollinearity. Because the interaction term is a mathematical product of family_risk_index and pct_pop_under_pov, a high correlation between them is inevitable.
In this context, high VIF values do not bias the model or invalidate the \(R^2\). They simply reflect the mathematical dependency created by our interaction effect. Since the model’s primary goal is to capture how poverty amplifies family risk, we have decided to keep these variables to preserve the sociological integrity of the analysis. The model remains stable for prediction, and the coefficients reflect the complex reality of social disadvantage.
4.3.3.6 Influential Observations
Identifying influential observations is the final step in our diagnostic process. A data point is considered “influential” if its exclusion significantly alters the model’s estimated coefficients. We use Cook’s Distance and the Residuals vs. Leverage plot to detect these points, which help us understand if the model is being overly driven by a small subset of extreme cases.
par(mfrow = c(1, 2))
plot(best_model_v, which = 4)
plot(best_model_v, which = 5)par(mfrow = c(1, 1))In the Residuals vs Leverage plot, we specifically look for points that fall beyond the dashed red lines (Cook’s distance thresholds). These are observations that combine high leverage (extreme values in the predictors) with high residuals (poor fit by the model).
Regarding the mathematical threshold:
cooks_d_v <- cooks.distance(best_model_v)
influential_v <- which(cooks_d_v > 4/nrow(datos_model))
cat("\nInfluential observations (Cook's D > 4/n):", length(influential_v), "\n")
Influential observations (Cook's D > 4/n): 80
Our analysis identifies 80 influential observations based on the \(4/n\) rule. However, a qualitative review confirms that these points are not the result of data entry errors or “noise.” Instead, they represent actual communities with extreme socioeconomic profiles and high crime rates.
From a predictive and policy-making perspective, these are the most critical observations for the model to capture. Removing them would lead to a “sanitized” model that performs well on average but fails exactly where it is needed most: in high-risk areas. Therefore, we have decided to retain all influential observations to ensure the model remains representative of the full complexity of urban crime dynamics.
In conclusion, the diagnostic results confirm that the model is statistically sound. The core OLS assumptions are met, ensuring that both the coefficient estimates and prediction intervals remain reliable. While structural multicollinearity and minor deviations in normality exist, the model provides a robust framework for analyzing the socioeconomic drivers of violent crime.
4.3.4 Predictive Performance
To conclude our analysis of violent crime, we evaluate how the selected model (V4: Interaction) performs on the entire dataset. We compare these results with our previous 10-fold cross-validation estimates to detect any potential overfitting.
4.3.4.1 Performance on Full Dataset
# Generate predictions
datos_model$pred_viol_trans <- predict(best_model_v, newdata = datos_model)
# Calculate metrics
rmse_full <- sqrt(mean((datos_model$viol_trans - datos_model$pred_viol_trans)^2, na.rm = TRUE))
r2_full <- cor(datos_model$viol_trans, datos_model$pred_viol_trans, use = "complete.obs")^2
#Performance in the full dataset:
cat(" R^2:", round(r2_full, 4), "\n") R^2: 0.6049
cat(" RMSE:", round(rmse_full, 4), "\n\n") RMSE: 1.3248
comparison_metrics <- data.frame(
Metric = c("R-Squared (R²)", "RMSE"),
Cross_Validation = c(round(cv_v4$summary$Mean_R2, 4), round(cv_v4$summary$Mean_RMSE, 4)),
Full_Dataset = c(round(r2_full, 4), round(rmse_full, 4)),
Difference = c(round(r2_full - cv_v4$summary$Mean_R2, 4), round(rmse_full - cv_v4$summary$Mean_RMSE, 4))
)
# Render table
knitr::kable(comparison_metrics,
col.names = c("Metric", "CV (Mean)", "Full Dataset", "Difference"),
caption = "Performance Consistency Check: CV vs. Full Data")| Metric | CV (Mean) | Full Dataset | Difference |
|---|---|---|---|
| R-Squared (R²) | 0.5981 | 0.6049 | 0.0068 |
| RMSE | 1.3274 | 1.3248 | -0.0026 |
The gap between the cross-validation results and the full dataset performance is minimal. A difference of less than 0.01 in \(R^2\) confirms that the model generalizes well and is not simply “memorizing” the noise in the training data. This stability gives us confidence that the chosen interaction model (V4) accurately captures the underlying socio-economic drivers of violent crime across different community subsets.
4.3.4.2 Understanding the metrics
In this section, we will obtain the equivalence of the metrics obtained in the transformed target for our original target.
We start by reobtaining the variable viol_crimes_per_popto our dataset:
# Reconstruct violent crime rate (original scale)
violent_crimes_per_pop <- inverse_boxcox(
datos_model$viol_trans,
lambda = 0.115
)And now, we inverse-transform the predictions made by the model, and calculate the metrics in our original “space”:
# Back-transform predictions to original scale
pred_violent_original <- inverse_boxcox(
datos_model$pred_viol_trans,
lambda = 0.115
)
# Calculate performance metrics on ORIGINAL scale
rmse_viol_original <- sqrt(mean(
(violent_crimes_per_pop - pred_violent_original)^2,
na.rm = TRUE
))
mae_viol_original <- mean(
abs(violent_crimes_per_pop - pred_violent_original),
na.rm = TRUE
)
r2_viol_original <- cor(
violent_crimes_per_pop,
pred_violent_original,
use = "complete.obs"
)^2
# Display metrics
cat(" RMSE:", round(rmse_viol_original, 2), "crimes per 100,000 inhabitants\n") RMSE: 415.73 crimes per 100,000 inhabitants
cat(" R^2: ", round(r2_viol_original, 4), "\n\n") R^2: 0.554
cat("On average, the model's predictions deviate by ±",
round(mae_viol_original, 1),
"violent crimes\nper 100,000 inhabitants from the actual values.\n\n")On average, the model's predictions deviate by ± 260.2 violent crimes
per 100,000 inhabitants from the actual values.
We can understand better now how our model works in the original dataset.
4.3.4.3 Prediction Intervals
While the average error is captured by RMSE, we need to understand the uncertainty of individual forecasts. The following table shows the 95% prediction intervals for a random sample of communities.
set.seed(2025)
sample_indices <- sample(1:nrow(datos_model), 10)
pred_intervals_v <- predict(best_model_v,
newdata = datos_model[sample_indices, ],
interval = "prediction",
level = 0.95)
pred_table_v <- data.frame(
Observation = sample_indices,
Predicted = round(pred_intervals_v[, "fit"], 3),
Lower_95 = round(pred_intervals_v[, "lwr"], 3),
Upper_95 = round(pred_intervals_v[, "upr"], 3),
Width = round(pred_intervals_v[, "upr"] - pred_intervals_v[, "lwr"], 3)
)
kable(pred_table_v, caption = "Sample Prediction Intervals (Transformed Scale)")| Observation | Predicted | Lower_95 | Upper_95 | Width | |
|---|---|---|---|---|---|
| 909 | 909 | 7.360 | 4.757 | 9.964 | 5.207 |
| 460 | 460 | 7.678 | 5.075 | 10.282 | 5.207 |
| 932 | 932 | 8.427 | 5.822 | 11.033 | 5.211 |
| 279 | 279 | 8.547 | 5.943 | 11.151 | 5.209 |
| 1534 | 1534 | 8.944 | 6.340 | 11.549 | 5.209 |
| 187 | 187 | 8.098 | 5.495 | 10.702 | 5.207 |
| 1290 | 1290 | 7.471 | 4.868 | 10.074 | 5.207 |
| 461 | 461 | 6.275 | 3.670 | 8.880 | 5.210 |
| 972 | 972 | 11.783 | 9.174 | 14.391 | 5.217 |
| 1563 | 1563 | 9.271 | 6.667 | 11.875 | 5.208 |
While the RMSE gives us a global measure of error, Prediction Intervals (PI) allow us to quantify the uncertainty for individual observations. Unlike a Confidence Interval—which estimates the mean response—the Prediction Interval accounts for both the model’s estimation error and the inherent random variation of a single community.
In the table above, we present the 95% PI for a random sample of 10 communities. For each community, the model provides a “Fit” (the point prediction) and a range defined by the “Lower” and “Upper” bounds. Mathematically, we expect that 95% of actual crime values will fall within these bounds.
The Width of these intervals is a direct indicator of the model’s precision. A narrow interval suggests that the community’s socio-economic profile leads to a highly reliable prediction, while a wider interval indicates that the community may have unique characteristics (outliers or high leverage) that increase the uncertainty of the forecast. This analysis is crucial for policy-making, as it identifies which community forecasts are reliable enough to guide resource allocation.
4.3.4.4 Visualization: Predicted vs Actual
The scatter plot below visualizes the model’s accuracy. In a perfect model, all points would lie on the dashed diagonal line.
ggplot(datos_model, aes(x = viol_trans, y = pred_viol_trans)) +
geom_point(alpha = 0.75, color = "darkred", size = 1.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed",
color = "black", linewidth = 1) +
labs(
title = "Violent Crimes",
subtitle = paste("Model:", "V4: Interaction"),
x = "Actual (Transformed Scale)",
y = "Predicted (Transformed Scale)"
) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))The plot shows a strong linear alignment, though some dispersion remains. This dispersion represents the community-specific factors (the residuals) that socio-economic variables alone cannot explain.
4.3.4.5 Coefficient Interpretation
The coefficients of our final model provide a detailed map of how socio-economic factors drive violent crime. Since the target variable is Box-Cox transformed (\(\lambda = 0.115\)), the magnitudes reflect changes on that transformed scale.
coef_summary_v <- summary(best_model_v)$coefficients
coef_table_v <- data.frame(
Variable = rownames(coef_summary_v),
Estimate = coef_summary_v[, "Estimate"],
Std_Error = coef_summary_v[, "Std. Error"],
t_value = coef_summary_v[, "t value"],
p_value = coef_summary_v[, "Pr(>|t|)"]
)
rownames(coef_table_v) <- NULL
kable(coef_table_v, digits = 4, caption = "Final Model Coefficients: Violent Crime")| Variable | Estimate | Std_Error | t_value | p_value |
|---|---|---|---|---|
| (Intercept) | 3.5778 | 0.1388 | 25.7779 | 0.0000 |
| family_risk_index | 0.3484 | 0.0122 | 28.6023 | 0.0000 |
| pct_pop_under_pov | 0.0722 | 0.0114 | 6.3075 | 0.0000 |
| racepctblack | 0.0077 | 0.0036 | 2.1342 | 0.0330 |
| pct_vacant_boarded | 0.0233 | 0.0107 | 2.1863 | 0.0289 |
| family_risk_index:pct_pop_under_pov | -0.0052 | 0.0007 | -7.6820 | 0.0000 |
Some insights and findings that can be obtained from the table:
Primary Driver:
Family Risk Index: This is the most powerful predictor in our model (Estimate: \(0.3484\), \(p < 0.001\)). Holding other factors constant, an increase in family instability (divorce, single-parent households, children born out of wedlock) has the strongest positive impact on violent crime rates. This underscores family structure as a critical sociological pillar in community safety.Economic Strain: Poverty (
pct_pop_under_pov) shows a strong positive relationship with crime (Estimate: \(0.0722\), \(p < 0.001\)). Communities with higher percentages of the population living below the poverty line are significantly more vulnerable to violent crime, confirming that economic deprivation is a fundamental catalyst.Interaction Effect: Family Risk × Poverty (Estimate: -0.0052, \(p < 0.05\)):The negative coefficient indicates a moderation effect, where the impact of family risk decreases as poverty increases. Mathematically:
\[\frac{\partial \text{Crime}}{\partial \text{Family Risk}} = 0.3484 - 0.0052 \times \text{Poverty}\]
In low-poverty communities (e.g., 5% poverty), the effect of family risk is strong: \(0.3484 - (0.0052 \times 5) = 0.322\).In high-poverty communities (e.g., 40% poverty), the effect is reduced: \(0.3484 - (0.0052 \times 40) = 0.140\). This suggests a “saturation effect.” In extremely deprived areas, poverty itself becomes the overwhelming driver of crime, making family structure a secondary (though still relevant) factor compared to its critical role in more affluent areas.
- Social and Physical Environment:
- Racial Composition:
racepctblackis statistically significant (\(p = 0.033\)), suggesting that even after controlling for poverty and family structure, communities with higher racial segregation (historically linked to systemic disinvestment) face higher violent crime risks. Urban Decay: The presence of abandoned buildings (pct_vacant_boarded) is a significant predictor (\(p = 0.028\)). This supports the “Broken Windows” theory or similar sociological perspectives, where the physical deterioration of a neighborhood acts as a signal that facilitates criminal activity.
- Racial Composition:
All predictors in our heuristic selection are statistically significant at the 95% confidence level (all \(p < 0.05\)). The model successfully moves beyond a simple linear addition of variables by identifying that the interplay between poverty and family structure is complex and non-linear, providing a more nuanced understanding of urban violence.
The analysis of violent crime has revealed a complex landscape where family instability and economic deprivation act as primary factors, with their relationship being non-additive as demonstrated by the significant interaction effect. However, criminological theory suggests that non-violent offenses—such as burglary, larceny, and auto theft—may follow different patterns, often being more sensitive to ‘opportunity’ factors, population density, and specific economic indicators like rental costs or median income.
In the following section, we replicate our modeling framework for Non-Violent Crimes (nonviol_trans). Our objective is to determine whether the structural drivers identified for violent crime remain consistent or if a different subset of predictors provides a more parsimonious and accurate explanation for property-related offenses.
4.4 Non-Violent Crime Regression Models
We now turn to the analysis of non-violent (property) crime, using the variable nonviol_trans. While violent and non-violent crimes share some common socioeconomic drivers, property crime is theorized to follow distinct causal mechanisms related to opportunity structures rather than purely social disorganization.
4.4.1 Model Comparison
In this section, we train and evaluate six candidate models for non-violent crime. Each model explores different levels of complexity and theoretical grounding. The final selection will be based on cross-validation performance, parsimony, and interpretability.
4.4.1.1 Model NV1: Single Best Predictor
We begin with the baseline model using only family_risk_index, which showed the strongest correlation with non-violent crime in our EDA.
formula_nv1 <- nonviol_trans ~ family_risk_index
cv_nv1 <- cross_validate_model(formula_nv1, datos_model, "nonviol_trans", "NV1: Single", k_folds)
cat(" Mean R^2:", round(cv_nv1$summary$Mean_R2, 3),
"±", round(cv_nv1$summary$SD_R2, 3), "\n") Mean R^2: 0.49 ± 0.062
cat(" Mean RMSE:", round(cv_nv1$summary$Mean_RMSE, 3),
"±", round(cv_nv1$summary$SD_RMSE, 3), "\n\n") Mean RMSE: 3.163 ± 0.217
This baseline establishes the performance floor. Family structure captures substantial variance even for property crime, though we expect economic and opportunity variables to add significant explanatory power.
4.4.1.3 Model NV3: Theory-Driven (Routine Activities Theory)
A parsimonious model grounded in Routine Activities Theory, which posits that property crime occurs when three elements converge: motivated offenders, suitable targets, and absence of guardianship. We include 7 predictors:
- Economic motivation:
med_income,pct_pop_under_pov,pct_unemployed - Target availability/density:
pop_dens - Guardianship:
pct_hous_own_occ(homeownership as proxy for community supervision) - Physical decay:
pct_vacant_boarded - Family structure:
family_risk_index
formula_nv3 <- nonviol_trans ~ med_income + pct_pop_under_pov +
pct_unemployed + pct_vacant_boarded + pop_dens +
pct_hous_own_occ + family_risk_index
cv_nv3 <- cross_validate_model(formula_nv3, datos_model, "nonviol_trans", "NV3: Theory", k_folds)
cat(" Mean R^2:", round(cv_nv3$summary$Mean_R2, 3),
"±", round(cv_nv3$summary$SD_R2, 3), "\n") Mean R^2: 0.503 ± 0.067
cat(" Mean RMSE:", round(cv_nv3$summary$Mean_RMSE, 3),
"±", round(cv_nv3$summary$SD_RMSE, 3), "\n\n") Mean RMSE: 3.122 ± 0.232
This model emphasizes opportunity and guardianship rather than social disorganization, reflecting the distinct criminological mechanisms underlying property crime.
4.4.1.4 Model NV4: Interaction Effects
We hypothesize that the effect of economic deprivation on property crime is amplified in high-density areas. Low income creates motivation, while high population density provides abundant targets and anonymity. Mathematically:
\[\text{Crime} = \beta_0 + \beta_1 \cdot \text{Income} + \beta_2 \cdot \text{Density} + \beta_3 \cdot (\text{Income} \times \text{Density}) + \epsilon\]
If \(\beta_3 \neq 0\), the relationship is synergistic: the combined effect of poverty and density exceeds the sum of their individual contributions. This aligns with the concentrated disadvantage perspective, where urban poverty creates disproportionate crime opportunities due to the increased proximity of targets and the potential breakdown of social control in crowded, low-income environments.
formula_nv4 <- nonviol_trans ~ med_income + pop_dens +
pct_pop_under_pov + family_risk_index +
med_income:pop_dens
cv_nv4 <- cross_validate_model(formula_nv4, datos_model, "nonviol_trans", "NV4: Interaction", k_folds)
cat(" Mean R^2:", round(cv_nv4$summary$Mean_R2, 3),
"±", round(cv_nv4$summary$SD_R2, 3), "\n") Mean R^2: 0.503 ± 0.064
cat(" Mean RMSE:", round(cv_nv4$summary$Mean_RMSE, 3),
"±", round(cv_nv4$summary$SD_RMSE, 3), "\n\n") Mean RMSE: 3.123 ± 0.236
Tests whether the relationship between income and property crime is context-dependent on population density.
4.4.1.5 Model NV5: Full Model (All Predictors)
formula_nv5 <- nonviol_trans ~ family_risk_index + racepctblack +
pct_pop_under_pov + pct_vacant_boarded + pct_unemployed +
pct_not_hs_grad + pct_larg_house_fam + med_income +
pct_hous_own_occ + race_pct_white + pct_immig_recent +
population + pop_dens + pct_not_speak_engl_well +
race_pct_hisp + pct_hous_occup + pct_b_sor_more + rent_median
cv_nv5 <- cross_validate_model(formula_nv5, datos_model, "nonviol_trans", "NV5: Full", k_folds)
cat(" Mean R^2:", round(cv_nv5$summary$Mean_R2, 3),
"±", round(cv_nv5$summary$SD_R2, 3), "\n") Mean R^2: 0.534 ± 0.069
cat(" Mean RMSE:", round(cv_nv5$summary$Mean_RMSE, 3),
"±", round(cv_nv5$summary$SD_RMSE, 3), "\n\n") Mean RMSE: 3.021 ± 0.23
Maximum complexity benchmark. Establishes the performance ceiling but likely overfits and sacrifices interpretability.
4.4.1.6 Model NV6: PCA-Based Model
As PCA is an unsupervised technique that operates on the predictors alone, the scores calculated for [Model V6] (#model-v6) remain valid for this section. Reusing these components addresses multicollinearity while providing a benchmark for predictive performance against our heuristic models.
# Use same PCA result from violent crime analysis
datos_pca_nv <- cbind(
nonviol_trans = datos_model$nonviol_trans,
pc_scores,
fold_id = datos_model$fold_id
)
formula_nv6 <- as.formula(paste("nonviol_trans ~", paste(colnames(pc_scores), collapse = " + ")))
cv_nv6 <- cross_validate_model(formula_nv6, datos_pca_nv, "nonviol_trans", "NV6: PCA", k_folds)
cat("Cross-Validation Results:\n")Cross-Validation Results:
cat(" Mean R^2:", round(cv_nv6$summary$Mean_R2, 3),
"±", round(cv_nv6$summary$SD_R2, 3), "\n") Mean R^2: 0.425 ± 0.073
cat(" Mean RMSE:", round(cv_nv6$summary$Mean_RMSE, 3),
"±", round(cv_nv6$summary$SD_RMSE, 3), "\n\n") Mean RMSE: 3.357 ± 0.234
Demonstrates dimensionality reduction but remains uninterpretable for policy guidance.
4.4.2 Non-Violent Crime Model Comparison
Let’s choose the best model for predicting our target variable now:
# Consolidate all CV results
comparison_nonviolent <- bind_rows(
cv_nv1$summary %>% mutate(N_Predictors = 1),
cv_nv2$summary %>% mutate(N_Predictors = 3),
cv_nv3$summary %>% mutate(N_Predictors = 7),
cv_nv4$summary %>% mutate(N_Predictors = 5),
cv_nv5$summary %>% mutate(N_Predictors = 18),
cv_nv6$summary %>% mutate(N_Predictors = n_components)
) %>%
dplyr::select(Model, N_Predictors, Mean_R2, SD_R2, Mean_RMSE, SD_RMSE) %>%
arrange(desc(Mean_R2))
kable(comparison_nonviolent, digits = 4,
caption = "Non-Violent Crime: Cross-Validation Performance Comparison")| Model | N_Predictors | Mean_R2 | SD_R2 | Mean_RMSE | SD_RMSE |
|---|---|---|---|---|---|
| NV5: Full | 18 | 0.5336 | 0.0689 | 3.0212 | 0.2297 |
| NV4: Interaction | 5 | 0.5029 | 0.0636 | 3.1231 | 0.2363 |
| NV3: Theory | 7 | 0.5028 | 0.0673 | 3.1216 | 0.2315 |
| NV2: Top 3 | 3 | 0.4941 | 0.0644 | 3.1497 | 0.2246 |
| NV1: Single | 1 | 0.4899 | 0.0618 | 3.1630 | 0.2169 |
| NV6: PCA | 6 | 0.4251 | 0.0727 | 3.3574 | 0.2339 |
4.4.2.1 Visual Comparison of Models
# Create comparison plots
p1_nv <- ggplot(comparison_nonviolent, aes(x = reorder(Model, Mean_R2), y = Mean_R2)) +
geom_col(fill = "green", alpha = 1) +
geom_errorbar(aes(ymin = Mean_R2 - SD_R2, ymax = Mean_R2 + SD_R2),
width = 0.2) +
coord_flip() +
labs(title = "Model Comparison: R²", x = "", y = "Mean R² (± SD)") +
theme_classic()
p2_nv <- ggplot(comparison_nonviolent, aes(x = reorder(Model, -Mean_RMSE), y = Mean_RMSE)) +
geom_col(fill = "darkred", alpha = 1) +
geom_errorbar(aes(ymin = Mean_RMSE - SD_RMSE, ymax = Mean_RMSE + SD_RMSE),
width = 0.2) +
coord_flip() +
labs(title = "Model Comparison: RMSE", x = "", y = "Mean RMSE (± SD)") +
theme_classic()
p1_nv + p2_nv + plot_annotation(title = "Cross-Validation Performance: All Models")4.4.2.2 Best Model Selection
Firsty, we discard Model NV1 beacuse it is the simplest and it was considered mainly to serve as a baseline.
Following the principle of parsimony, we discard Model NV5 (Full) due to its excessive complexity and Model NV6 (PCA) for its poor performance and lack of interpretability.While Model NV2 shows a slight improvement over the baseline (NV1), the gain is marginal, confirming that family_risk_index already captures the bulk of the explainable variance.
Between the remaining candidates, Model NV4 (Interaction) is the most efficient choice; it matches the performance of the more complex Model NV3 (Theory) with fewer predictors (\(R^2 \approx 0.503\)). Consequently, NV4 is selected as the final model for non-violent crimes, offering an optimal balance between accuracy and simplicity.
The selected model formula is:
print(formula_nv4)nonviol_trans ~ med_income + pop_dens + pct_pop_under_pov + family_risk_index +
med_income:pop_dens
cat("\n")Now, let’s fit all the data to our selected model:
best_model_nv <- lm(formula_nv4, data = datos_model)4.4.2.3 Interaction Term Significance Test
To justify the selection of Model V4, we formally test whether the interaction between family instability and poverty is statistically significant. A significant interaction indicates that these two social forces do not act independently but instead amplify or moderate each other.
coef_matrix <- summary(best_model_nv)$coefficients
int_row_name <- grep(":", rownames(coef_matrix), value = TRUE)
interaction_stats <- coef_matrix[int_row_name, ]
cat("Term:", int_row_name, "\n")Term: med_income:pop_dens
cat("Estimate:",interaction_stats["Estimate"], "\n")Estimate: 1.401305e-08
cat("t-value:", round(interaction_stats["t value"], 3), "\n")t-value: 5.231
cat("p-value:", format(interaction_stats["Pr(>|t|)"], scientific = TRUE), "\n\n")p-value: 1.875361e-07
The interaction term med_income:pop_dens yields a coefficient of \(1.40 \times 10^{-8}\). While this magnitude appears near zero, its extremely low p-value (\(1.87 \times 10^{-7}\)) and positive t-statistic (\(5.231\)) confirm it is a highly significant driver of non-violent crime.
For example, for med_income = $30000, pop_dens= 2000, the interaction effect is around 0.84; so the interaction is not zero in practice; but a powerful multiplier.
4.4.3 Model Diagnostic
We now verify that the selected model satisfies the core assumptions of linear regression.
4.4.3.1 Linearity
plot(best_model_nv, which = 1)The Residuals vs Fitted plot confirms that the linearity assumption is met. The residuals are randomly distributed around the horizontal zero line with no discernible “U-shape” or systematic curvature. The red trend line remains relatively flat, indicating that the model successfully captures the linear relationship between the predictors and the transformed target, despite minor influence from community outliers like 1625 and 379.
4.4.3.2 Normality of Residuals
We assess whether residuals follow an approximately normal distribution.
par(mfrow = c(1, 2))
plot(best_model_nv, which = 2)
hist(residuals(best_model_nv), breaks = 50, main = "Residual Distribution",
xlab = "Residuals", col = "lightblue", border = "white")par(mfrow = c(1, 1))
residuals_nv <- residuals(best_model_nv)
shapiro_test_nv <- shapiro.test(residuals_nv)
cat(" W-statistic:", round(shapiro_test_nv$statistic, 4), "\n") W-statistic: 0.9746
cat(" p-value:", format(shapiro_test_nv$p.value, scientific = TRUE), "\n\n") p-value: 7.185665e-18
The Shapiro-Wilk test yields a W-statistic of 0.9746 and a p-value of \(7.18 \times 10^{-18}\). While the low p-value technically rejects the null hypothesis of perfect normality, this is a common result in large datasets (\(N \approx 1900\)), where formal tests become hypersensitive to minor, practically irrelevant deviations. The W-statistic remains very close to 1, indicating a high degree of similarity to a normal distribution.When interpreted alongside the Q-Q Plot and the Residual Distribution histogram, it is evident that the errors are well-centered and follow the diagonal reference line for nearly all observations. The slight departures at the tails are expected in social science data and do not compromise the validity of the model’s coefficients or prediction intervals.
4.4.3.3 Homoscedasticity
We verify constant residual variance across fitted values.
plot(best_model_nv, which = 3)bp_test_nv <- bptest(best_model_nv)
cat("Breusch-Pagan Test statistic:", round(bp_test_nv$statistic, 4), "\n")Breusch-Pagan Test statistic: 6.6978
cat(" p-value:", format(bp_test_nv$p.value, scientific = TRUE), "\n\n") p-value: 2.441043e-01
The Scale-Location plot and the Breusch-Pagan test confirm that the homoscedasticity assumption is satisfied for the non-violent crime model. In the plot, the red trend line is nearly horizontal, and the standardized residuals display a consistent vertical spread across all fitted values, indicating that the error variance remains constant.
Statistically, the Breusch-Pagan test results in a p-value of 0.2441 (\(p > 0.05\)). Since we fail to reject the null hypothesis, we conclude that there is no significant evidence of heteroscedasticity. This ensures that the model’s standard errors and subsequent prediction intervals are reliable across the entire dataset.
4.4.3.4 Independence
We check for autocorrelation in residuals.
dw_test_nv <- dwtest(best_model_nv)
cat("Durbin-Watson Test statistic:", round(dw_test_nv$statistic, 4), "\n")Durbin-Watson Test statistic: 2.0159
cat(" p-value:", format(dw_test_nv$p.value, scientific = TRUE), "\n\n") p-value: 6.363979e-01
The Durbin-Watson test statistic of 2.0159 is nearly ideal, as a value of 2 indicates zero autocorrelation. The p-value (0.636) confirms that the independence assumption is satisfied. This ensures that the residuals are not systematically correlated and behave as random noise.
4.4.3.5 Multicollinearity
We assess correlation among predictors using VIF.
vif_values_nv <- vif(best_model_nv)
cat("Variance Inflation Factors:\n")Variance Inflation Factors:
vif_table_nv <- data.frame(
Variable = names(vif_values_nv),
VIF = round(vif_values_nv, 2)
)
kable(vif_table_nv, row.names = FALSE)| Variable | VIF |
|---|---|
| med_income | 3.65 |
| pop_dens | 13.11 |
| pct_pop_under_pov | 2.87 |
| family_risk_index | 2.45 |
| med_income:pop_dens | 13.88 |
cat("\nMaximum VIF:", round(max(vif_values_nv), 2), "\n\n")
Maximum VIF: 13.88
The maximum VIF of 13.88 for the interaction and 13.11 for population density exceeds the traditional threshold of 10. However, this represents structural multicollinearity, which is expected when including interaction terms as predictors are mathematically linked. Since this does not bias the coefficients or the \(R^2\), the variables are retained to preserve the model’s ability to analyze the complex relationship between income and urban density.
4.4.3.6 Influential Observations
We identify observations with disproportionate influence on coefficients.
par(mfrow = c(1, 2))
plot(best_model_nv, which = 4)
plot(best_model_nv, which = 5)par(mfrow = c(1, 1))cooks_d_nv <- cooks.distance(best_model_nv)
influential_nv <- which(cooks_d_nv > 4/nrow(datos_model))
cat("Influential observations (Cook's D > 4/n):", length(influential_nv), "\n\n")Influential observations (Cook's D > 4/n): 92
The diagnostic plots identify 92 influential observations exceeding the \(4/n\) threshold. These cases reflect real communities with extreme socioeconomic profiles rather than data entry errors. All observations are retained to preserve the model’s policy relevance and its ability to accurately analyze high-risk urban areas.
In conclusion, the diagnostic results confirm that the model is statistically sound. The core OLS assumptions are met, ensuring that both the coefficient estimates and prediction intervals remain reliable. While structural multicollinearity and minor deviations in normality exist, the model provides a robust framework for analyzing the socioeconomic drivers of non-violent crime.
4.4.4 Predictive Performance
To conclude our analysis of violent crime, we evaluate how the selected model (V4: Interaction) performs on the entire dataset. We compare these results with our previous 10-fold cross-validation estimates to detect any potential overfitting.
4.4.4.1 Performance on Full Dataset
# Generate predictions
datos_model$pred_nonviol_trans <- predict(best_model_nv, newdata = datos_model)
# Calculate metrics
rmse_full_nv <- sqrt(mean((datos_model$nonviol_trans - datos_model$pred_nonviol_trans)^2, na.rm = TRUE))
r2_full_nv <- cor(datos_model$nonviol_trans, datos_model$pred_nonviol_trans, use = "complete.obs")^2
#Performance in the full dataset:
cat(" R^2:", round(r2_full_nv, 4), "\n") R^2: 0.5093
cat(" RMSE:", round(rmse_full_nv, 4), "\n\n") RMSE: 3.1211
comparison_metrics <- data.frame(
Metric = c("R-Squared (R²)", "RMSE"),
Cross_Validation = c(round(cv_nv4$summary$Mean_R2, 4), round(cv_nv4$summary$Mean_RMSE, 4)),
Full_Dataset = c(round(r2_full_nv, 4), round(rmse_full_nv, 4)),
Difference = c(round(r2_full_nv - cv_nv4$summary$Mean_R2, 4), round(rmse_full_nv - cv_nv4$summary$Mean_RMSE, 4))
)
# Render table
knitr::kable(comparison_metrics,
col.names = c("Metric", "CV (Mean)", "Full Dataset", "Difference"),
caption = "Performance Consistency Check: CV vs. Full Data")| Metric | CV (Mean) | Full Dataset | Difference |
|---|---|---|---|
| R-Squared (R²) | 0.5029 | 0.5093 | 0.0064 |
| RMSE | 3.1231 | 3.1211 | -0.0020 |
The minimal gap between cross-validation and full dataset metrics confirms that the model generalizes well without overfitting.
4.4.4.2 Understanding the metrics
In this section, we will obtain the equivalence of the metrics obtained in the transformed target for our original target.
We start by reobtaining the variable viol_crimes_per_popto our dataset:
# Reconstruct violent crime rate (original scale)
non_viol_per_pop <- inverse_boxcox(
datos_model$nonviol_trans,
lambda = 0.246
)And now, we inverse-transform the predictions made by the model, and calculate the metrics in our original “space”:
# Back-transform predictions to original scale
pred_nonviolent_original <- inverse_boxcox(
datos_model$pred_nonviol_trans,
lambda = 0.246
)
# Calculate performance metrics on ORIGINAL scale
rmse_nv_original <- sqrt(mean(
(non_viol_per_pop - pred_nonviolent_original)^2,
na.rm = TRUE
))
mae_nv_original <- mean(
abs(non_viol_per_pop - pred_nonviolent_original),
na.rm = TRUE
)
r2_nv_original <- cor(
non_viol_per_pop,
pred_nonviolent_original,
use = "complete.obs"
)^2
# Display metrics
cat("Performance Metrics (Original Scale):\n")Performance Metrics (Original Scale):
cat(" RMSE:", round(rmse_nv_original, 2), "crimes per 100,000 inhabitants\n") RMSE: 2075.61 crimes per 100,000 inhabitants
cat(" R²: ", round(r2_nv_original, 4), "\n\n") R²: 0.4552
cat("On average, the model's predictions deviate by ±",
round(mae_nv_original, 1),
"non-violent crimes\nper 100,000 inhabitants from the actual values.\n\n")On average, the model's predictions deviate by ± 1395 non-violent crimes
per 100,000 inhabitants from the actual values.
We can understand better now how our model works in the original dataset.
4.4.4.3 Prediction Intervals
We generate 95% prediction intervals for a random sample of communities.
set.seed(2025)
sample_indices_nv <- sample(1:nrow(datos_model), 10)
pred_intervals_nv <- predict(best_model_nv,
newdata = datos_model[sample_indices_nv, ],
interval = "prediction",
level = 0.95)
pred_table_nv <- data.frame(
Observation = sample_indices_nv,
Predicted = round(pred_intervals_nv[, "fit"], 3),
Lower_95 = round(pred_intervals_nv[, "lwr"], 3),
Upper_95 = round(pred_intervals_nv[, "upr"], 3),
Width = round(pred_intervals_nv[, "upr"] - pred_intervals_nv[, "lwr"], 3)
)
kable(pred_table_nv, caption = "Sample Prediction Intervals: Non-Violent Crime (Transformed Scale)")| Observation | Predicted | Lower_95 | Upper_95 | Width | |
|---|---|---|---|---|---|
| 909 | 909 | 25.825 | 19.692 | 31.959 | 12.268 |
| 460 | 460 | 26.363 | 20.227 | 32.499 | 12.273 |
| 932 | 932 | 28.098 | 21.961 | 34.236 | 12.275 |
| 279 | 279 | 27.965 | 21.831 | 34.100 | 12.269 |
| 1534 | 1534 | 28.295 | 22.149 | 34.440 | 12.291 |
| 187 | 187 | 27.044 | 20.911 | 33.177 | 12.266 |
| 1290 | 1290 | 26.220 | 20.085 | 32.355 | 12.271 |
| 461 | 461 | 23.313 | 17.171 | 29.456 | 12.285 |
| 972 | 972 | 34.560 | 28.417 | 40.702 | 12.286 |
| 1563 | 1563 | 29.724 | 23.590 | 35.857 | 12.268 |
The prediction intervals show a consistent width of approximately 12.27 units across the sample. This uniformity suggests that the model’s uncertainty is stable across different communities. However, the range is notably wider than the one observed for violent crimes, confirming that non-violent offenses are more volatile and harder to predict at an individual level.
While the model captures the general socio-economic trend, the large width indicates that individual community factors not included in the model (the noise) play a significant role in property-related crime.
4.4.4.4 Visualization: Predicted vs Actual
The scatter plot below visualizes the model’s accuracy. In a perfect model, all points would lie on the dashed diagonal line.
ggplot(datos_model, aes(x = nonviol_trans, y = pred_nonviol_trans)) +
geom_point(alpha = 0.75, color = "darkorange", size = 1.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed",
color = "black", linewidth = 1) +
labs(
title = "Non-Violent Crimes",
subtitle = paste("Model: NV4: Interaction"),
x = "Actual (Transformed Scale)",
y = "Predicted (Transformed Scale)"
) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))The Predicted vs. Actual plot shows a solid positive correlation, with the cloud of points following the dashed identity line. The visible dispersion reflects the \(R^2\) of 0.509, confirming that non-violent crimes are more volatile and contain more noise than violent ones. A notable detail is that the model seems to be more conservative at the extremes, as it recognizes these areas are high-crime, but it can’t quite capture just how high they actually are.
4.4.4.5 Coefficient Interpretation
The coefficients of our final model provide a detailed map of how socio-economic factors drive violent crime. Since the target variable is Box-Cox transformed (\(\lambda = 0.246\)), the magnitudes reflect changes on that transformed scale.
coef_summary_nv <- summary(best_model_nv)$coefficients
coef_table_nv <- data.frame(
Variable = rownames(coef_summary_nv),
Estimate = coef_summary_nv[, "Estimate"],
Std_Error = coef_summary_nv[, "Std. Error"],
t_value = coef_summary_nv[, "t value"],
p_value = coef_summary_nv[, "Pr(>|t|)"]
)
rownames(coef_table_nv) <- NULL
kable(coef_table_nv, digits = 4, caption = "Final Model Coefficients: Non-Violent Crime")| Variable | Estimate | Std_Error | t_value | p_value |
|---|---|---|---|---|
| (Intercept) | 22.2256 | 0.5275 | 42.1327 | 0.0000 |
| med_income | -0.0001 | 0.0000 | -5.0614 | 0.0000 |
| pop_dens | -0.0005 | 0.0001 | -5.8426 | 0.0000 |
| pct_pop_under_pov | 0.0192 | 0.0143 | 1.3357 | 0.1818 |
| family_risk_index | 0.5159 | 0.0200 | 25.8489 | 0.0000 |
| med_income:pop_dens | 0.0000 | 0.0000 | 5.2307 | 0.0000 |
Some insights and findings that can be obtained from the table:
Dominant Driver: Family Risk Index (Estimate: \(0.5159\), \(p < 0.001\)):As with violent crime, family instability remains the most powerful predictor. Interestingly, its impact is even larger in this model (\(0.5159\)) than in the violent crime model (\(0.3484\)), suggesting that the breakdown of social control within the family unit is a universal catalyst for both types of offenses in these communities.
The Interaction: Income × Density (Estimate: \(1.40 \times 10^{-8}\), \(p < 0.001\)): While the individual estimates for med_income and pop_dens are negative, their interaction is positive and highly significant. This suggests a synergistic effect aligned with Opportunity Theory.
It follow the formula
\[\frac{\partial \text{Non-Violent Crime}}{\partial \text{Income}} = -0.0001 + (1.401305 \times 10^{-8}) \times \text{Density}\]
The Poverty Paradox (
pct_pop_under_pov, \(p = 0.1818\)):Unlike the violent crime model where poverty was a fundamental catalyst, here it is not statistically significant at the 95% level. This suggests that for non-violent property crimes, the specific economic opportunity (represented by the income/density interaction) and social structure (family risk) are more direct predictors than the general poverty rate of the population.Urban Environment (
pop_dens, \(p < 0.001\)):The negative main effect for density, when interpreted alongside the positive interaction, suggests that density alone does not drive crime; rather, it acts as a multiplier for other risk factors like available wealth.
4.5 Comparative Analysis and Conclussions
After developing and evaluating models for both violent and non-violent crime, we now synthesize our findings to understand how these two crime types differ in their socioeconomic drivers, predictability, and policy implications.
4.5.1 Model Performance Comparison
# Consolidate final metrics for both targets
global_comparison <- data.frame(
Crime_Type = c("Violent Crime", "Non-Violent Crime"),
Selected_Model = c("V4: Interaction", "NV4: Interaction"),
# Number of predictors (subtracting 1 for the intercept)
N_Predictors = c(
length(attr(terms(formula_v4), "term.labels")),
length(attr(terms(formula_nv4), "term.labels"))
),
# R2 from the model fitted to the full dataset
Full_Dataset_R2 = c(
round(summary(best_model_v)$r.squared, 4),
round(summary(best_model_nv)$r.squared, 4)
),
# R2 and RMSE from Cross-Validation (using the tables created previously)
CV_Mean_R2 = c(
round(comparison_violent$Mean_R2[comparison_violent$Model == "V4: Interaction"], 4),
round(comparison_nonviolent$Mean_R2[comparison_nonviolent$Model == "NV4: Interaction"], 4)
),
CV_Mean_RMSE = c(
round(comparison_violent$Mean_RMSE[comparison_violent$Model == "V4: Interaction"], 4),
round(comparison_nonviolent$Mean_RMSE[comparison_nonviolent$Model == "NV4: Interaction"], 4)
)
)
# Render the table
knitr::kable(global_comparison,
col.names = c("Target", "Model", "Predictors", "Full R²", "CV Mean R²", "CV Mean RMSE"),
caption = "Final Model Performance: Violent vs Non-Violent Crime")| Target | Model | Predictors | Full R² | CV Mean R² | CV Mean RMSE |
|---|---|---|---|---|---|
| Violent Crime | V4: Interaction | 5 | 0.6049 | 0.5981 | 1.3274 |
| Non-Violent Crime | NV4: Interaction | 5 | 0.5093 | 0.5029 | 3.1231 |
Explanatory Power: Violent crime is more accurately explained by socioeconomic factors (\(R^2 \approx 0.60\)) than non-violent crime (\(R^2 \approx 0.51\)). This suggests property crimes are likely influenced by additional environmental or “opportunity” factors not captured in this dataset.
Parsimony: Both models use only 5 predictors to capture the bulk of the explainable variance. This validates our heuristic approach, proving that a small set of targeted variables is more efficient than the high-complexity 18-predictor model.
Stability: The minimal gap between CV metrics and Full Dataset results (difference \(< 0.01\) in \(R^2\)) confirms excellent model stability. Both models generalize well and are free from overfitting.
Error Scale: The significantly higher RMSE for non-violent crime (3.12 vs 1.32) reflects the higher frequency and greater inherent variance of property-related offenses compared to violent ones.
4.5.2 Criminological Implications. Theoretical Comparison
The distinct drivers in our models reflect two core criminological frameworks:
Violent Crime (Social Disorganization): The results highlight structural disruption. The family_risk_index is the primary engine of violence, while the negative interaction with poverty suggests a saturation effect; in extreme deprivation, poverty becomes the overwhelming driver, overshadowing other social factors.
Non-Violent Crime (Rotuine Activity Theory): Property crimes are driven by opportunity rather than just deprivation. The significance of the Income \(\times\) Density interaction proves that affluent, crowded urban areas provide a higher concentration of suitable targets, facilitating opportunistic theft regardless of general poverty levels.
Shared Drivers:
Family Risk Index: This is the most consistent and powerful predictor across both models. It confirms that the breakdown of family-based social control is the primary catalyst for all criminal activity.
Economic Strain: Poverty is a fundamental driver for both crime types. It creates social strain in violent crimes and provides the economic incentive for property-related offenses.
Physical Environment: Urban decay, such as abandoned buildings, acts as a universal signal of disorder. This neglect facilitates both violent and non-violent crime by lowering the perceived risk of detection.
5 General Linear Models
In the previous sections, we developed Linear Regression models to predict violent and non-violent crime rates as continuous outcomes. These models successfully explained 50-60% of the variance in crime levels and identified key socioeconomic and demographic drivers. However, OLS addresses only one facet of the crime prediction problem: quantifying how much crime occurs.
In practice, policymakers and law enforcement agencies often face a different question: which communities require priority intervention? Rather than precise numerical predictions, resource allocation decisions typically rely on risk stratification; identifying communities that exceed critical thresholds.
Generalized Linear Models (GLMs) allow us to reframe the prediction problem to address this complementary perspective. Specifically:
Logistic regression enables binary classification of communities into high-risk vs. low-risk categories based on an interpretable crime index. This directly supports policy decisions by answering: “Should this community be prioritized for intervention programs?”
5.1 GLM Data Preparation
For this section, we are going to consider the same 18 predictors as for Linear Regression. Also, the violent and non-violent variables will be used in its original version (no transformation), and a new target variable will be created for classification:
- Variable consistency: All GLMs use the identical predictor set from our OLS analysis to ensure direct comparability and avoid post-hoc specification searching. This design allows us to isolate the effect of modeling framework rather than confounding it with different variable selections.
#This was the final dataset that we were working on:
datos_glm <- datos_clean2 %>%
dplyr::select(-all_of(drop_names)) %>%
dplyr::select(-any_of("state"))This dataset contains the original target variables violent_crimes_per_popand non_viol_per_pep, and does not contain the transformed target variables viol_transand nonviol_trans.
- Target construction for Logistic Regression: we consolidate our two target variables into a single measure of community risk. We define a Weighted Crime Index that assigns a higher weight to violent offenses, reflecting their greater severity and impact on public safety. The index for each community \(i\) is calculated as follows:
\[CrimeIndex_i = 2 \cdot ViolentCrimes_i + NonViolentCrimes_i\]
This weighting reflects empirical evidence that violent offenses generate higher economic costs (medical expenses, lost productivity), greater psychological harm to communities, and stronger public demand for intervention than property crimes.
A new classification variable called risk will be created in terms of the metric crime_indexas if follows:
Communities are then classified as high risk if their crime index exceeds the 70th percentile threshold, and low risk in other situation.
This approach identifies the top 30% of communities requiring priority resource allocation, aligning with typical policy constraints on intervention capacity.
datos_glm <- datos_glm %>%
mutate(
crime_index = 2 * violent_crimes_per_pop + 1 * non_viol_per_pop,
# Binary classification: high-risk if above 75th percentile
risk = ifelse(crime_index >= quantile(crime_index, 0.7, na.rm = TRUE), 1, 0),
risk = factor(risk, levels = c(0, 1), labels = c("Low", "High"))
)We can see here that the variable was created as expected: mapping 0 to Low risk and 1 to High risk:
contrasts(datos_glm$risk) High
Low 0
High 1
High just means that these configuration represents the dummy variable: is the risk high?
5.1.1 Training Split
We will follow the same training split as in LR Models to ensure a fair comparison. So, 10-folds will be used in this section too.
First, let’s ensure reproducibility:
# Set seed for reproducibility
set.seed(2025)
# Define number of folds
k_folds <- 10
# Create fold assignments in the dataframe
datos_glm <- datos_glm %>%
mutate(fold_id = sample(rep(1:k_folds, length.out = n())))
cat("- Number of folds:", k_folds, "\n")- Number of folds: 10
cat("- Observations per fold:", round(nrow(datos_model)/k_folds), "\n")- Observations per fold: 190
In the same way as in LR, we will create a personalised function of cross-validation that will calculate and store interesting metrics that will be used to select the best model in the future.
This function distinguishes between logistic and Gamma models, and calculates different metrics for each of them (the first one is used in classification problems, whereas the second one is used in regression ones, just in case we decided to use them in the future).
cross_validate_glm <- function(formula, data, target_name, model_name, family_type = "gaussian", folds = 10) {
set.seed(2025)
metrics_list <- list()
for (fold in 1:folds) {
test_indices <- which(data$fold_id == fold)
train_data <- data[-test_indices, ]
test_data <- data[test_indices, ]
# Fit the Generalized Linear Model
model <- glm(formula, data = train_data, family = family_type)
# Generate predictions
predictions <- predict(model, newdata = test_data, type = "response")
actual <- test_data[[target_name]]
# Extract family name properly
if (inherits(family_type, "family")) {
family_name <- family_type$family
} else if (is.function(family_type)) {
family_name <- family_type()$family
} else {
family_name <- family_type
}
# Logistic Metrics
if (family_name == "binomial") {
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Convert factor to numeric if needed
actual_numeric <- as.numeric(actual) - 1 # Converts "Low"=0, "High"=1
accuracy <- mean(predicted_classes == actual_numeric, na.rm = TRUE)
roc_obj <- roc(actual_numeric, predictions, quiet = TRUE)
auc_val <- as.numeric(auc(roc_obj))
fold_metrics <- data.frame(
Fold = fold, Model = model_name, Family = "Logistic",
Accuracy = accuracy, AUC = auc_val
)
# Gamma metrics
} else if (family_name == "Gamma") {
mse <- mean((actual - predictions)^2, na.rm = TRUE)
rmse <- sqrt(mse)
mae <- mean(abs(actual - predictions), na.rm = TRUE)
# R-squared
ss_res <- sum((actual - predictions)^2, na.rm = TRUE)
ss_tot <- sum((actual - mean(actual, na.rm = TRUE))^2, na.rm = TRUE)
r2 <- 1 - (ss_res / ss_tot)
fold_metrics <- data.frame(
Fold = fold, Model = model_name, Family = "Gamma",
RMSE = rmse, MAE = mae, R2 = r2
)
}
metrics_list[[fold]] <- fold_metrics
}
# Results consolidation
detailed_results <- bind_rows(metrics_list)
summary_stats <- detailed_results %>%
group_by(Model, Family) %>%
summarise(across(where(is.numeric) & !matches("Fold"),
list(Mean = ~mean(.x, na.rm = TRUE), SD = ~sd(.x, na.rm = TRUE)),
.names = "{.col}_{.fn}"),
.groups = "drop")
return(list(detailed = detailed_results, summary = summary_stats))
}Now, let’s start with the first GLM study and modelling.
5.2 Logistic Regression
Logistic regression models the probability of binary outcomes using the logistic (sigmoid) function. Unlike OLS, which assumes normally distributed errors, logistic regression is specifically designed for dichotomous dependent variables.
Let \(Y_i \in \{0, 1\}\) denote whether community \(i\) is classified as high-crime. We model the probability of a community being high-crime given a set of predictors \(\mathbf{X}_i\):
\[P(Y_i = 1 \mid \mathbf{X}_i) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip})}}\]
Equivalently, we can express this using the logit link function, which maps the probability to the real line \((-\infty, \infty)\):
\[\log\left(\frac{P(Y_i = 1)}{1 - P(Y_i = 1)}\right) = \beta_0 + \sum_{j=1}^{p} \beta_j X_{ij}\]
Where: * The left side represents the log-odds of a community being classified as high-crime. * \(\beta_0\) is the intercept (the log-odds when all predictors are zero). * The coefficients \(\beta_j\) quantify the change in the log-odds for a one-unit increase in predictor \(X_j\), holding all other variables constant.
Because log-odds are difficult to interpret directly, we transform the coefficients by exponentiating them to obtain Odds Ratios (OR):
\[OR_j = e^{\beta_j}\]
The Odds Ratio provides a multiplicative effect on the odds: * \(OR > 1\): The predictor increases the odds of being classified as high-crime. * \(OR < 1\): The predictor decreases the odds of being classified as high-crime. * \(OR = 1\): The predictor has no effect on the odds.
For example, \(\text{OR}_{\text{poverty}} = 1.15\), each percentage point increase in poverty multiplies the odds of high crime classification by 1.15 (a 15% increase).
5.2.1 Model Development
Four different candidate models will be explored in this section, each of them more complex than the prior. The idea is to choose a model complex enough to capture the relationships between our variables but not too much complex to capture the noise.
5.2.1.1 Model Logit 1: Univariate Predictor
Given that family_risk_index emerged as the strongest predictor in both LR models, we begin with a parsimonious univariate specification to assess its standalone discriminatory power:
# Define formula
formula_logit1 <- risk ~ family_risk_index
# Cross-validate for performance assessment
cv_logit1 <- cross_validate_glm(
formula = formula_logit1,
data = datos_glm,
target_name = "risk",
model_name = "Logit1: Single",
family_type = binomial(link = "logit"),
folds = 10
)
# Results
cat(" Mean Accuracy:", round(cv_logit1$summary$Accuracy_Mean, 3),
"±", round(cv_logit1$summary$Accuracy_SD, 3), "\n") Mean Accuracy: 0.834 ± 0.025
cat(" Mean AUC:", round(cv_logit1$summary$AUC_Mean, 3),
"±", round(cv_logit1$summary$AUC_SD, 3), "\n\n") Mean AUC: 0.896 ± 0.028
This baseline confirms that family structure instability alone captures substantial discriminatory power, validating its theoretical importance. However, a single-predictor model likely underfits the complexity of crime risk classification.
5.2.1.2 Model Logit 2: Top 3 Predictors
We expand to a minimal yet theoretically complete model capturing the three fundamental dimensions of community disadvantage: family structure, economic deprivation, and income levels. This parsimonious specification tests whether these three core variables can achieve competitive classification performance.
formula_logit2 <- risk ~ family_risk_index + pct_pop_under_pov + med_income
cv_logit2 <- cross_validate_glm(
formula = formula_logit2,
data = datos_glm,
target_name = "risk",
model_name = "Logit2: Top 3",
family_type = binomial(link = "logit"),
folds = 10
)
cat(" Mean Accuracy:", round(cv_logit2$summary$Accuracy_Mean, 3),
"±", round(cv_logit2$summary$Accuracy_SD, 3), "\n") Mean Accuracy: 0.836 ± 0.028
cat(" Mean AUC:", round(cv_logit2$summary$AUC_Mean, 3),
"±", round(cv_logit2$summary$AUC_SD, 3), "\n\n") Mean AUC: 0.898 ± 0.026
By combining family instability with both poverty rates and median income, this model captures complementary aspects of economic strain: poverty measures the proportion of households in deprivation, while median income reflects the overall community wealth level.
5.2.1.3 Model Logit 3: Balanced Five (5 Predictors)
We add two critical dimensions to the previous model: human capital (education) and environmental decay (vacant housing). This specification captures five distinct sociological mechanisms while maintaining interpretability.
# Balanced 5-predictor model covering distinct dimensions
formula_logit3 <- risk ~ family_risk_index + pct_pop_under_pov +
pct_unemployed + pct_not_hs_grad + pct_vacant_boarded
cv_logit3 <- cross_validate_glm(
formula = formula_logit3,
data = datos_glm,
target_name = "risk",
model_name = "Logit3: Balanced 5",
family_type = binomial(link = "logit"),
folds = 10
)
cat(" Mean Accuracy:", round(cv_logit3$summary$Accuracy_Mean, 3),
"±", round(cv_logit3$summary$Accuracy_SD, 3), "\n") Mean Accuracy: 0.834 ± 0.03
cat(" Mean AUC:", round(cv_logit3$summary$AUC_Mean, 3),
"±", round(cv_logit3$summary$AUC_SD, 3), "\n\n") Mean AUC: 0.897 ± 0.026
This model represents a carefully balanced specification:
Family structure (
family_risk_index): Social disorganizationEconomic strain (
pct_pop_under_pov): Material deprivationLabor market exclusion (
pct_unemployed): Employment opportunityHuman capital deficit (
pct_not_hs_grad): Educational attainmentPhysical decay (
pct_vacant_boarded): Environmental disorder
Each predictor captures a conceptually distinct pathway to crime, ensuring we avoid redundancy while maximizing explanatory breadth.
5.2.1.4 Model Logit 4: Theory-Driven Model
A comprehensive model grounded in Social Disorganization Theory and Strain Theory. We incorporate demographic composition and population density alongside the five core predictors, providing a fuller picture of community risk.
# Theory-driven model with 9 key predictors
formula_logit4 <- risk ~ family_risk_index + pct_pop_under_pov +
med_income + pct_unemployed + pct_not_hs_grad + pct_b_sor_more +
pct_vacant_boarded + pop_dens + race_pct_white
cv_logit4 <- cross_validate_glm(
formula = formula_logit4,
data = datos_glm,
target_name = "risk",
model_name = "Logit4: Theory",
family_type = binomial(link = "logit"),
folds = 10
)
cat(" Mean Accuracy:", round(cv_logit4$summary$Accuracy_Mean, 3),
"±", round(cv_logit4$summary$Accuracy_SD, 3), "\n") Mean Accuracy: 0.839 ± 0.03
cat(" Mean AUC:", round(cv_logit4$summary$AUC_Mean, 3),
"±", round(cv_logit4$summary$AUC_SD, 3), "\n\n") Mean AUC: 0.901 ± 0.027
This model balances explanatory power with interpretability. All predictors have clear sociological justification and capture different dimensions of community risk: family structure, economic strain, human capital, environmental decay, population density, and demographic composition.
5.2.1.5 Model Logit 5: Full Model (All predictors)
Finally, we fit the complete specification with all 18 predictors to establish the performance ceiling and assess whether additional variables improve classification accuracy beyond the theory-driven specification.
# Full model with all 18 predictors
formula_logit5 <- risk ~ population + racepctblack + race_pct_white +
race_pct_hisp + med_income + pct_pop_under_pov + pct_not_hs_grad +
pct_b_sor_more + pct_unemployed + pct_immig_recent + pct_not_speak_engl_well +
pct_larg_house_fam + pct_hous_occup + pct_hous_own_occ + pct_vacant_boarded +
rent_median + pop_dens + family_risk_index
cv_logit5 <- cross_validate_glm(
formula = formula_logit5,
data = datos_glm,
target_name = "risk",
model_name = "Logit5: Full",
family_type = binomial(link = "logit"),
folds = 10
)
cat(" Mean Accuracy:", round(cv_logit5$summary$Accuracy_Mean, 3),
"±", round(cv_logit5$summary$Accuracy_SD, 3), "\n") Mean Accuracy: 0.846 ± 0.025
cat(" Mean AUC:", round(cv_logit5$summary$AUC_Mean, 3),
"±", round(cv_logit5$summary$AUC_SD, 3), "\n\n") Mean AUC: 0.908 ± 0.026
Maximum complexity benchmark. This model establishes whether the full predictor set yields meaningful gains over more parsimonious specifications, or whether simpler models capture most discriminatory signal. The full model may suffer from multicollinearity and reduced interpretability, making it less suitable for policy guidance despite potential performance improvements.
5.2.2 Logistic Model Comparison
Let’s choose the best model for classifying high-risk communities:
# Consolidate CV summaries
comparison_logit <- bind_rows(
cv_logit1$summary %>% mutate(N_Predictors = 1),
cv_logit2$summary %>% mutate(N_Predictors = 3),
cv_logit3$summary %>% mutate(N_Predictors = 5),
cv_logit4$summary %>% mutate(N_Predictors = 9),
cv_logit5$summary %>% mutate(N_Predictors = 18)
) %>%
dplyr::select(Model, N_Predictors, Accuracy_Mean, Accuracy_SD, AUC_Mean, AUC_SD) %>%
arrange(desc(AUC_Mean))
kable(comparison_logit, digits = 4,
caption = "Logistic Regression: Cross-Validation Performance Comparison")| Model | N_Predictors | Accuracy_Mean | Accuracy_SD | AUC_Mean | AUC_SD |
|---|---|---|---|---|---|
| Logit5: Full | 18 | 0.8459 | 0.0247 | 0.9083 | 0.0258 |
| Logit4: Theory | 9 | 0.8386 | 0.0299 | 0.9014 | 0.0273 |
| Logit2: Top 3 | 3 | 0.8365 | 0.0282 | 0.8976 | 0.0256 |
| Logit3: Balanced 5 | 5 | 0.8344 | 0.0302 | 0.8974 | 0.0264 |
| Logit1: Single | 1 | 0.8338 | 0.0248 | 0.8956 | 0.0281 |
5.2.2.1 Visual Comparison of Models
# Create comparison plots
p1 <- ggplot(comparison_logit, aes(x = reorder(Model, AUC_Mean), y = AUC_Mean)) +
geom_col(fill = "steelblue", alpha = 0.8) +
geom_errorbar(aes(ymin = AUC_Mean - AUC_SD, ymax = AUC_Mean + AUC_SD),
width = 0.2) +
coord_flip() +
labs(title = "Model Comparison: AUC", x = "", y = "Mean AUC (± SD)") +
theme_classic()
p2 <- ggplot(comparison_logit, aes(x = reorder(Model, Accuracy_Mean), y = Accuracy_Mean)) +
geom_col(fill = "darkgreen", alpha = 0.8) +
geom_errorbar(aes(ymin = Accuracy_Mean - Accuracy_SD, ymax = Accuracy_Mean + Accuracy_SD),
width = 0.2) +
coord_flip() +
labs(title = "Model Comparison: Accuracy", x = "", y = "Mean Accuracy (± SD)") +
theme_classic()
p2 + p1 + plot_annotation(title = "Cross-Validation Performance: All Logistic Models")5.2.2.2 Best Model Selection
Based on the graphs and the complexity of the models, we will choose Model Logit 2 as our final model. This decision is based in two main observations:
Model 2 is the best performing model out of the simplest models.
Model 2 is the least complex model out of the best performing ones.
So this model strikes the optimal balance between discriminatory power and interpretability. With only three predictors (family instability, poverty, and median income) it achieves classification performance equivalent to far more complex specifications. These three variables together provide complementary perspectives on community risk while remaining easily interpretable.
# Assign final model
best_formula_logit <- formula_logit2
best_model_name <- "Logit2: Top 3"print(best_formula_logit)risk ~ family_risk_index + pct_pop_under_pov + med_income
# Fit on entire dataset
best_model_logit <- glm(best_formula_logit, data = datos_glm, family = binomial(link = "logit"))5.2.3 Model Diagnostics
Now, we must verify that the results are not misleading. Unlike linear regression, logistic regression does not require normality or homoscedasticity of residuals. However, we still need to check for multicollinearity and influential observations to ensure the model’s stability and reliability.
5.2.3.1 Multicollinearity
We use the Variance Inflation Factor (VIF) to detect this issue. For logistic regression, VIF > 5 is generally considered problematic.
# Calculate VIF
vif_values_logit <- vif(best_model_logit)
vif_table_logit <- data.frame(
Variable = names(vif_values_logit),
VIF = round(vif_values_logit, 4)
) %>%
arrange(desc(VIF))
kable(vif_table_logit, row.names = FALSE, caption = "VIF Analysis: Multicollinearity Check")| Variable | VIF |
|---|---|
| med_income | 2.6021 |
| pct_pop_under_pov | 2.3656 |
| family_risk_index | 1.4441 |
max_vif_logit <- max(vif_values_logit)
cat("\nMaximum VIF:", round(max_vif_logit, 2), "\n")
Maximum VIF: 2.6
No variables exceed the VIF threshold of 5. Multicollinearity is not a concern.
5.2.3.2 Influencial Observations
Identifying influential observations is critical to ensure the model is not overly driven by a few extreme cases. We use Cook’s Distance to detect communities whose exclusion would substantially alter the estimated coefficients.
# Calculate Cook's Distance
cooks_d_logit <- cooks.distance(best_model_logit)
influential_threshold <- 4 / nrow(datos_glm)
influential_logit <- which(cooks_d_logit > influential_threshold)
influential_pct <- 100 * length(influential_logit) / nrow(datos_glm)
cat("Influential observations (Cook's D > 4/n):", length(influential_logit), "\n")Influential observations (Cook's D > 4/n): 109
cat("Percentage of dataset:", round(influential_pct, 2), "%\n\n")Percentage of dataset: 5.73 %
# Plot Cook's Distance
par(mfrow = c(1, 2))
plot(best_model_logit, which = 4, main = "Cook's Distance", sub.caption = "")
plot(best_model_logit, which = 5, main = "Residuals vs Leverage", sub.caption = "")par(mfrow = c(1, 1))As before, qualitative review confirms these are not data entry errors but rather legitimate extreme communities.
From a policy perspective, these high-risk communities are precisely the ones we need the model to identify correctly.
Removing them would create a “sanitized” model that performs well on typical cases but fails where intervention is most needed: in communities at the distributional extremes. Therefore, we retain all observations to ensure the model remains representative of the full spectrum of urban crime dynamics.
5.2.4 Predictive Performance
To conclude our logistic regression analysis, we evaluate how the selected model performs on the entire dataset and compare these results with our cross-validation estimates to detect any potential overfitting.
5.2.4.1 Performance on Full Dataset
datos_glm$pred_prob <- predict(best_model_logit, newdata = datos_glm, type = "response")
# Generate class predictions (using 0.5 threshold)
datos_glm$pred_class <- factor(
ifelse(datos_glm$pred_prob >= 0.5, "High", "Low"),
levels = c("Low", "High")
)
# Calculate confusion matrix
conf_matrix_full <- confusionMatrix(
data = datos_glm$pred_class,
reference = datos_glm$risk,
positive = "High"
)
# ROC and AUC
roc_full <- roc(datos_glm$risk, datos_glm$pred_prob, levels = c("Low", "High"), quiet = TRUE)
auc_full <- as.numeric(auc(roc_full))
cat(" Accuracy:", round(conf_matrix_full$overall["Accuracy"], 4), "\n") Accuracy: 0.8375
cat(" AUC:", round(auc_full, 4), "\n") AUC: 0.8979
cat(" Sensitivity:", round(conf_matrix_full$byClass["Sensitivity"], 4), "\n") Sensitivity: 0.6445
cat(" Specificity:", round(conf_matrix_full$byClass["Specificity"], 4), "\n") Specificity: 0.9204
cat(" Precision:", round(conf_matrix_full$byClass["Precision"], 4), "\n\n") Precision: 0.7764
# Comparison table
comparison_metrics_logit <- data.frame(
Metric = c("AUC", "Accuracy"),
Cross_Validation = c(
round(cv_logit2$summary$AUC_Mean, 4),
round(cv_logit2$summary$Accuracy_Mean, 4)
),
Full_Dataset = c(
round(auc_full, 4),
round(conf_matrix_full$overall["Accuracy"], 4)
),
Difference = c(
round(auc_full - cv_logit2$summary$AUC_Mean, 4),
round(conf_matrix_full$overall["Accuracy"] - cv_logit2$summary$Accuracy_Mean, 4)
)
)
kable(comparison_metrics_logit,
col.names = c("Metric", "CV (Mean)", "Full Dataset", "Difference"),
caption = "Performance Consistency Check")| Metric | CV (Mean) | Full Dataset | Difference | |
|---|---|---|---|---|
| AUC | 0.8976 | 0.8979 | 0.0004 | |
| Accuracy | Accuracy | 0.8365 | 0.8375 | 0.0011 |
The minimal gap between CV and full dataset performance confirms that the model generalizes well and is not overfitting. This indicates that the model can reliably classify unseen communities with stable performance across different data partitions. It demonstrates that the model’s discriminatory power is not an artifact of a favorable train-test split but rather reflects genuine patterns in the relationship between socioeconomic conditions and crime risk.
5.2.4.2 Confusion Matrix Analysis
The confusion matrix provides a detailed breakdown of classification performance.
conf_matrix_2x2 <- conf_matrix_full$table
kable(conf_matrix_2x2,
caption = "Confusion Matrix")| Low | High | |
|---|---|---|
| Low | 1225 | 203 |
| High | 106 | 368 |
From a resource allocation standpoint, False Negatives are more costly than False Positives. A False Negative means failing to identify a truly high-risk community, potentially allowing crime to escalate unchecked. In contrast, a False Positive results in allocating resources to a lower-risk area—inefficient but not dangerous.
5.2.4.3 ROC Curve and Optimal Threshold
The Receiver Operating Characteristic (ROC) curve visualizes the trade-off between Sensitivity (correctly identifying high-risk communities) and Specificity (correctly identifying low-risk communities) across all possible classification thresholds.
# Plot ROC curve
ggroc(roc_full, size = 1.2, color = "steelblue") +
geom_abline(intercept = 1, slope = 1, linetype = "dashed", color = "gray50", size = 0.8) +
annotate("text", x = 0.25, y = 0.25,
label = paste("AUC =", round(auc_full, 3)),
size = 6, fontface = "bold") +
labs(
title = "ROC Curve",
x = "1 - FPR",
y = "TPR"
) +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))5.2.4.4 Threshold Selection
While we use the default threshold of 0.5 for balanced classification, it could be adjusted based on resource constraints and risk tolerance:
Lower threshold: Increases Sensitivity but reduces Specificity. Identifies more high-risk communities at the cost of more false alarms. Appropriate when intervention resources are abundant and the cost of missing high-risk areas is high
Higher threshold: Increases Specificity but reduces Sensitivity. Fewer false alarms but risks missing truly high-risk areas. Appropriate when resources are scarce and must be concentrated on the most certain cases
# Calculate optimal threshold using Youden's Index (maximizes Sensitivity + Specificity - 1)
coords_roc <- coords(roc_full, "best", ret = c("threshold", "sensitivity", "specificity"),
best.method = "youden")
cat("Optimal threshold (Youden's Index):", round(coords_roc$threshold, 3), "\n")Optimal threshold (Youden's Index): 0.275
cat(" Sensitivity:", round(coords_roc$sensitivity, 3), "\n") Sensitivity: 0.842
cat(" Specificity:", round(coords_roc$specificity, 3), "\n\n") Specificity: 0.796
This threshold would be efficient if the consequence of missing on a high-risk zone is bigger than the cpnsequence of sending resources to a low-risk zone.
Let’s see how well the different thresholds sepparate high risk from low risk communities by plotting predicted probabilities colored by actual risk levels:
optimal_threshold <- 0.275
threshold_df <- data.frame(
threshold = c(0.5, optimal_threshold),
type = c("Default threshold", "Optimal threshold")
)
ggplot(datos_glm, aes(x = seq_along(pred_prob), y = pred_prob)) +
# First color scale: points
geom_point(aes(color = risk), alpha = 0.6, size = 1.5) +
scale_color_manual(
values = c(
"Low" = "green2",
"High" = "red2"
),
name = "Actual Risk"
) +
# Reset color scale
new_scale_color() +
# Second color scale: threshold lines
geom_hline(
data = threshold_df,
aes(yintercept = threshold, color = type, linetype = type),
size = 1
) +
scale_color_manual(
values = c(
"Default threshold" = "black",
"Optimal threshold" = "blue"
),
name = ""
) +
scale_linetype_manual(
values = c(
"Default threshold" = "dashed",
"Optimal threshold" = "solid"
),
name = ""
) +
labs(
title = "Predicted Probabilities by Risk Level",
subtitle = "Comparison of both classification thresholds",
x = "Community Index",
y = "Predicted Prob of High Risk"
) +
theme_minimal(base_size = 10) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "top"
)Communities near the thresholds represent uncertain cases where socioeconomic indicators do not clearly signal high or low risk. These may be communities in transition or with mixed risk factors, and should be studied independently when doing a new prediction.
5.2.4.5 Coefficient Interpretation
The coefficients of a logistic regression are expressed on the log-odds scale, which is not directly interpretable. By exponentiating them, we obtain Odds Ratios (OR), which quantify the multiplicative effect of a one-unit increase in the predictor on the odds of high-risk classification.
or_table_clean <- tidy(best_model_logit, conf.int = TRUE, exponentiate = TRUE) %>%
filter(term != "(Intercept)") %>%
dplyr::select(term, estimate, conf.low, conf.high, p.value) %>%
rename(
Predictor = term,
`Odds Ratio` = estimate,
`CI Lower` = conf.low,
`CI Upper` = conf.high,
`p-value` = p.value
)
kable(or_table_clean, digits = c(0, 4, 4, 4, 4),
caption = "Odds Ratios for High-Risk Classification")| Predictor | Odds Ratio | CI Lower | CI Upper | p-value |
|---|---|---|---|---|
| family_risk_index | 1.4824 | 1.4176 | 1.5539 | 0.0000 |
| pct_pop_under_pov | 1.0439 | 1.0165 | 1.0717 | 0.0014 |
| med_income | 1.0000 | 1.0000 | 1.0000 | 0.0248 |
Let’s analyze one by one and interpretate them:
# Extract ORs for interpretation
or_family <- or_table_clean$`Odds Ratio`[or_table_clean$Predictor == "family_risk_index"]
or_poverty <- or_table_clean$`Odds Ratio`[or_table_clean$Predictor == "pct_pop_under_pov"]
or_income <- or_table_clean$`Odds Ratio`[or_table_clean$Predictor == "med_income"]- Family Risk Index
cat(" Family Risk Index OR =", round(or_family,2)) Family Risk Index OR = 1.48
Interpretation: Each unit increase in the family risk index multiplies the odds of being classified as high-risk by 1.48
- Poverty
cat(" Family Risk Index OR =", round(or_poverty,2)) Family Risk Index OR = 1.04
Interpretation: Each 1% increase in poverty multiplies the odds of being classified as high-risk in 1.04
For example, if two communities are identical but one has a 10% poverty rate and the other one a 30%, this second community has \((1.04)^{20} = 2.19\) times the odds of being classified as risky, i.e. \(119%\) more chance of being classified as high risk.
- Median Income
Surprisingly, the model associates risk with income in a direct relation.
cat(" Median Income OR =", or_income, "\n") Median Income OR = 1.000025
Interpretation: Each $1 increase in median income multiplies the odds by 1.000025.
This number seems small, but if wetwo identical commmunities differ in \(\$10000\) in median income, the difference in odds is \(1.000025^{10000} = 1.284\)
Let’s predict the probability for an hypothetical community with some specific characteristics:
# Define example community
example_community <- data.frame(
family_risk_index = 14,
pct_pop_under_pov = 50,
med_income = 15000
)
# Get model coefficients
coefs <- coef(best_model_logit)
intercept <- coefs["(Intercept)"]
beta_family <- coefs["family_risk_index"]
beta_poverty <- coefs["pct_pop_under_pov"]
beta_income <- coefs["med_income"]
# Calculate log-odds
log_odds <- intercept +
beta_family * example_community$family_risk_index +
beta_poverty * example_community$pct_pop_under_pov +
beta_income * example_community$med_income
# Calculate probability
prob_high <- plogis(log_odds)
# Calculate odds
odds_high <- exp(log_odds)
cat(" Family Risk Index:", example_community$family_risk_index, "\n") Family Risk Index: 14
cat(" Poverty Rate:", example_community$pct_pop_under_pov, "%\n") Poverty Rate: 50 %
cat(" Median Income: $", format(example_community$med_income, big.mark = ","), "\n\n") Median Income: $ 15,000
cat(" Log-odds =", round(intercept, 4),
"+", round(beta_family, 4), "×", example_community$family_risk_index,
"+", round(beta_poverty, 4), "×", example_community$pct_pop_under_pov,
"+", format(beta_income, scientific = FALSE), "×", example_community$med_income, "=", round(log_odds, 4), "\n\n") Log-odds = -8.325 + 0.3937 × 14 + 0.043 × 50 + 0.00002532434 × 15000 = -0.284
cat(" Odds of High Risk =", round(odds_high, 4), "\n") Odds of High Risk = 0.7528
cat(" Probability of High Risk =", round(prob_high, 4), "\n\n") Probability of High Risk = 0.4295
cat("Classification: This community would be classified as",
ifelse(prob_high >= optimal_threshold, "HIGH RISK", "LOW RISK"), "\n")Classification: This community would be classified as HIGH RISK
Notice that the model would classify this as high or low risk depending if we chose the usual threshold (0.5) or the optimal threshold (0.275).
This section extended the linear regression framework by adopting Generalized Linear Models (GLMs) to address complementary policy-relevant questions in crime analysis. While Ordinary Least Squares focused on predicting how much crime occurs, GLMs enabled us to model risk classification and distributionally appropriate crime intensities, thereby broadening both the interpretability and applicability of the results.
6 Conclusions
This study developed a comprehensive, theory-driven statistical analysis of community-level crime, combining rigorous data preparation, exploratory analysis, and multiple regression frameworks to address both crime intensity and crime risk classification. By progressively increasing model complexity and adapting the modeling strategy to the nature of each research question, the analysis provides robust, interpretable, and policy-relevant insights.
- Data Preparation
The data preparation phase was a critical component of the analysis due to the dimensionality, heterogeneity, and statistical complexity of the dataset. Crime-related variables exhibited strong skewness, socioeconomic predictors were measured on very different scales, and several variables were highly correlated. Addressing these issues required careful variable selection, the removal of identifiers, consistency checks across modeling stages, and the construction of composite indicators that captured key social dimensions while reducing redundancy.
This preprocessing effort ensured that subsequent analyses were both statistically valid and substantively interpretable. Rather than being treated as a mechanical step, data preparation directly shaped the modeling strategy by enabling fair comparisons across models and preventing specification-driven bias. Without this structured preprocessing pipeline, the inferential results obtained later would have been unreliable.
- Exploratory Data Analysis
Exploratory Data Analysis revealed strong structural patterns in the data that guided all subsequent modeling decisions. Both violent and non-violent crime rates displayed pronounced right-skewness and heteroskedasticity, indicating that naïve linear modeling on the original scale would violate key assumptions. Clear monotonic relationships were observed between crime rates and indicators of socioeconomic disadvantage, particularly poverty, unemployment, family instability, and educational attainment.
EDA also highlighted substantial correlations among predictors, reinforcing the need for parsimonious and theoretically grounded model specifications. Importantly, this phase served not only a descriptive purpose but also an inferential one, informing transformation choices, variable prioritization, and the decision to model violent and non-violent crime separately.
- Linear Regression
Linear regression models were developed to explain the intensity of violent and non-violent crime as continuous outcomes. After applying appropriate transformations and diagnostic corrections, the final models achieved moderate to strong explanatory power, accounting for approximately 50–60% of the observed variance. This level of performance is notable given the multifactorial and socially complex nature of crime phenomena.
Across both crime types, a consistent set of predictors emerged as statistically and substantively significant, with family structure instability and economic deprivation playing central roles. While effect magnitudes differed between violent and non-violent crime, the underlying structural drivers were largely shared, suggesting common social mechanisms. Linear regression thus provided a robust explanatory baseline and clarified how community characteristics translate into differing levels of crime intensity.
- General Linear Models
Logistic regression reframed the analysis from predicting crime levels to classifying communities according to their risk status, directly addressing policy-relevant questions related to prioritization and resource allocation. Using a theoretically motivated crime index and cross-validated model selection, the analysis demonstrated that relatively simple specifications achieved strong and stable discriminatory performance. In particular, parsimonious models captured most of the classification signal, with only marginal gains from increased complexity.
The interpretability of logistic regression proved to be a key advantage. Estimated odds ratios allowed for a transparent assessment of how changes in socioeconomic conditions affect the likelihood of a community being classified as high risk. Moreover, the stability of performance across folds indicated that the classification results were robust and generalizable, making logistic regression a practical complement to linear models rather than a substitute.
- Overall Synthesis
Taken as a whole, the analysis shows that crime can be understood both as a continuous outcome and as a discrete risk process, and that different modeling frameworks are required to address each perspective. The convergence of findings across data preparation, EDA, linear regression, and logistic regression strengthens confidence in the identified drivers of crime, particularly the central role of family instability and economic disadvantage.
Ultimately, this work demonstrates that careful preprocessing, theory-informed variable selection, and appropriate statistical modeling are as important as predictive performance itself. By integrating explanatory and classification approaches within a coherent analytical pipeline, the study provides both substantive insight into crime dynamics and a defensible basis for policy-relevant decision-making.
7 References
In this section, some articles and pages that were used and consulted throughout the process will be shown, to:
- Give credit to the author of certain ideas.
- Show the veracity of certain arguments used in the project.